53
RANLP 2015 Natural Language Processing for Translation Memories (NLP4TM) Proceedings of the Workshop September 11, 2015 Hissar, Bulgaria

Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

  • Upload
    vandien

  • View
    219

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

RANLP 2015

Natural Language Processing forTranslation Memories (NLP4TM)

Proceedings of the Workshop

September 11 2015Hissar Bulgaria

The Workshop onNatural Language Processing

for Translation Memories (NLP4TM)associated with THE INTERNATIONAL CONFERENCE

RECENT ADVANCES INNATURAL LANGUAGE PROCESSING 2015

PROCEEDINGS

Hissar Bulgaria11 September 2015

ii

Introduction

Translation Memories (TM) are amongst the most used tools by professional translators The underlyingidea of TMs is that a translator should benefit as much as possible from previous translations by beingable to retrieve how a similar sentence was translated before Despite the fact that the core idea of thesesystems relies on comparing segments (typically of sentence length) from the document to be translatedwith segments from previous translations most of the existing TM systems hardly use any languageprocessing for this Instead of addressing this issue most of the work on translation memories focusedon improving the user experience by allowing processing of a variety of document formats intuitive userinterfaces and so on

The term second generation translation memories has been around for more than ten years and it promisestranslation memory software that integrates linguistic processing in order to improve the translationprocess This linguistic processing can involve the matching of subsentential chunks the editing ofdistance operations between syntactic trees and the incorporation of semantic and discourse informationin the matching process Terminologies glossaries and ontologies are also very useful for translationmemories facilitating the task of the translator and ensuring a consistent translation The field of NaturalLanguage Processing (NLP) has proposed numerous methods for terminology extraction and ontologyextraction which can be integrated in the translation process The building of translation memories fromcorpora is another field where methods from NLP can contribute to improving the translation process

We are happy we could include in the workshop programme 4 long contributions and 3 short papersdealing with the aforementioned issues

Viacutet Baisa Aleš Horaacutek and Marek Medvedrsquo discuss in Increasing Coverage of Translation Memories withLinguistically Motivated Segment Combination Methods how it is possible to extend existing translationmemories using linguistically motivated segments combining approaches concentrated on preservinghigh translational quality

Linked to the topic of enhancing existing translation memories in the paper Spotting false translationsegments in translation memories Eduard Barbu presents a method for identifying false translations intranslation memories thought as a classification task

In Improving translation memory fuzzy matching by paraphrasing Konstantinos Chatzitheodorouexplores the use of paraphrasing in retrieving better segments from translation memories The methodrelies on NooJ and performs consistently better than the state of the art on EN-IT language pair

CATaLog New Approaches to TM and Post Editing Interfaces presents a new CAT tool by Tapas NayekSudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Vela and Josef van Genabith The aim ofthe tool is to improve both the performance and the productivity of post-editing

Carla Parra Escartiacuten aims at bridging the gap between academic research on Translation Memories (TMs)and the actual needs and wishes of translators in Creation of new TM segments Fulfilling translatorsrsquowishes She presents a pilot study where the requests of translators are being implemented in thetranslation workflow

Katerina Timonera and Ruslan Mitkov explore the use of clause splitting in better retrieval of segmentsTheir paper Improving Translation Memory Matching through Clause Splitting show that their methodleads to a statistically significant increase in the number of retrieved matches when both the inputsegments and the segments in the TM are first processed with a clause splitter

The Organising Committee would like to thank the Programme Committee who responded with very fastbut also substantial reviews for the workshop programme This workshop would not have been possible

iii

without the support received from the EXPERT project (FP72007-2013 under REA grant agreement no317471 httpexpert-itneu)

Constantin Orasan and Rohit Gupta

iv

Organizers

Constantin Orasan University of Wolverhampton UK

Rohit Gupta University of Wolverhampton UK

Program Committee

Manuel Arcedillo Hermes SpainJuanjo Arevalillo Hermes SpainEduard Barbu Translated ItalyYves Champollion WordFast FranceGloria Corpas University of Malaga SpainMaud Ehrmann EPFL SwitzerlandKevin Flanagan Swansea University UKGabriela Gonzalez eTrad ArgentinaManuel Herranz Pangeanic SpainQun Liu DCU IrelandRuslan Mitkov University of Wolverhampton UKGabor Proszeky Morphologic HungaryUwe Reinke Cologne University of Applied Sciences GermanyMichel Simard NRC CanadaMark Shuttleworth UCL UKMasao Utiyama NICT JapanAndy Way DCU IrelandMarcos Zampieri Saarland University and DFKI GermanyVentsislav Zhechev Autodesk

Invited Speaker

Marcello Federico Fondazione Bruno Kessler Italy

v

Table of Contents

Creation of new TM segments Fulfilling translatorsrsquo wishesCarla Parra Escartiacuten 1

Spotting false translation segments in translation memoriesEduard Barbu 9

Improving Translation Memory Matching through Clause SplittingKaterina Raisa Timonera and Ruslan Mitkov 17

Improving translation memory fuzzy matching by paraphrasingKonstantinos Chatzitheodoroou 24

Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Meth-ods

Viacutet Baisa Ales Horak and Marek Medvedrsquo 31

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek Sudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Vela and Josef van

Genabith 36

vii

Conference Program

900ndash910 WelcomeConstantin Orasan

910ndash1010 Automatically tidying up and extending translation memories (invited talk)Marcello Federico FBK Italy

1015ndash1100 Creation of new TM segments Fulfilling translatorsrsquo wishesCarla Parra Escartiacuten

1130ndash1215 Spotting false translation segments in translation memoriesEduard Barbu

1215ndash1300 Improving Translation Memory Matching through Clause SplittingKaterina Raisa Timonera and Ruslan Mitkov

1430ndash1500 Improving translation memory fuzzy matching by paraphrasingKonstantinos Chatzitheodoroou

1500ndash1530 Increasing Coverage of Translation Memories with Linguistically Motivated Seg-ment Combination MethodsViacutet Baisa Ales Horak and Marek Medvedrsquo

1530ndash1600 CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek Sudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Velaand Josef van Genabith

1615ndash1700 Round table and final discussion

ix

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 1ndash8Hissar Bulgaria Sept 2015

Creation of new TM segments Fulfilling translatorsrsquo wishes

Carla Parra EscartiacutenHermes Traducciones

C Coacutelquide 6 portal 2 3ž I28230 Las Rozas Madrid Spain

carlaparrahermestranscom

Abstract

This paper aims at bridging the gap be-tween academic research on TranslationMemories (TMs) and the actual needs andwishes of translators Starting from aninternal survey in a translation companywe analyse what translators wished trans-lation memories offered them We are cur-rently implementing one of their sugges-tions to assess its feasibility and whetheror not it retrieves more TM matches asoriginally expected We report on how thesuggestion is being implemented and theresults obtained in a pilot study

1 Introduction

Professional translators use translation memorieson a daily basis In fact most translation projectsnowadays require the usage of Computer AssistedTranslation (CAT) tools Sometimes translatorswill freely choose to work with a particular tooland sometimes it will be the client who imposesthe usage of such tool One of the main advantagesof CAT tools is that they allow for the integrationof different productivity enhancement features

Translation Memories (TMs) are used to re-trieve past translations and partial past translations(fuzzy matches) Terminology databases (TBs) areused to enhance the coherence on terminology andto ensure that the right terminology is used in allprojects Moreover some CAT tools offer addi-tional functionalities such as the usage of predic-tive texts (Autosuggest in SDL Trados Studio1

and Muses in MemoQ2) the automatic assem-bly of fragments to produce translations of newsegments or specific customizable Quality As-surance (QA) features

1wwwsdlcom2wwwmemoqcom

In the context of a translation company the us-age of these productivity enhacement tools is partof the whole translation workflow Project man-agers use them to generate word counts and esti-mate the time and resources needed to make thetranslation They also use them to pre-translatefiles and clean them up prior to delivery to theclient Translators use these tools to translate re-vise and proofread the translations prior to finaldelivery

Several researchers have worked on enhancingTMs using Natural Language Processing (NLP)techniques (Hodaacutesz and Pohl 2005 Pekar andMitkov 2007 Mitkov and Corpas 2008 Simardand Fujita 2012 Gupta and Orasan 2014Vanallemeersch and Vandeghinste 2015 Guptaet al 2015) Despite reporting positive resultsit seems that the gap between academic researchand industry still exists We carried out an inter-nal survey in our company to detect potential newfeatures for TMs that could be implemented byour RampD team The aim was to identify potentialnew features that project managers and translatorswished for and that would enhance their produc-tivity In this paper we report on the results of thatsurvey and further explain how we are implement-ing one of the suggestions we received The re-mainder of this paper is structured as follows Sec-tion 3 summarises the survey we carried out andthe replies we got and our company is briefly in-troduced in Section 2 Section 4 explains how weimplemented one of the suggestions we receivedSubsections 41ndash43 describe how we created newTM segments out of existing TMs Section 5 re-ports on the evaluation on a pilot test to assess thereal usability of this approach Finally section 6summarises our work and discusses future work

2 Our company

Hermes is a leader company in the translation in-dustry in Spain It was founded in 1991 and has

1

a vast experience in multilingual localisation andtranslation projects With offices in Madrid andMaacutelaga the company has broad knowledge andexperience in computer-assisted translation soft-ware and specific localisation software includingSDL Trados memoQ Deacutejagrave Vu IBM Translation-Manager Star Transit WordFast Catalyst Pas-solo Across Idiom World Server Microsoft He-lium and Microsoft Localisation Studio amongothers

Our company objectives are built upon a solidfoundation which have allowed us to achieve adouble quality certification for translation services(UNE-EN-150382006 and ISO-90012008 stan-dards) Our in-house translators and project man-agers commit daily to provide our clients with highquality translations for different specialised fields(IT medicine technical manuals general textsetc)

3 Internal survey

We asked our in-house translators and projectmanagers for potential new functionalities thatCAT tools could offer them More concretelywe asked them which was according to themthe missing functionality as far as TMs are con-cerned In total 10 project manager and 14 trans-lators participated in this internal survey Whilenot all had a clear idea of what could be imple-mented some interesting suggestions came up

We gathered ideas such as scheduling automaticTM reorganisation to prevent that large TMs endup corrupted It is a well known fact that largeTMs need to be periodically reorganized (ie re-indexed and eventually cleaned-up) While thisfeature is available in standard CAT tools such asStudio 20143 and memoQ4 it is not always car-ried out automatically and it is difficult to estimate

3In previous versions of Studio TM reorganisationwas available for all types of TMs in the Transla-tion Memory Settings Dialog Box gt Performance andTuning As of Studio 2014 file-based TMs are au-tomatically reorganised while server-based TMs re-quire periodical reorganisation For further informa-tion see httpproducthelpsdlcomSDL20Trados20Studioclient_enRefO-TTM_SettingsTranslation_Memory_Settings_Dialog_Box__Performance_and_Tuninghtm

4memoQ actually has a repairing function the Trans-lation memory repair wizard which aims at repairing(ie re-indexing) a corrupted TM According to thedocumentation it is also possible to run this functionon TMs which are not corrupted For further infor-mation see httpkilgraycommemoq2015help-enindexhtmlrepair_resource3html

when such reorganisation should be carried outScheduling it to be run automatically when a par-ticular number of segments (eg 500) have beenadded since the last reorganisation for instancewould prevent the loss of a TM because of badmaintenance

Ideas more related to NLP included the auto-matic correction of orthotypography in the tar-get language and allowing for multilingual TMswhere several source and target languages can beused for concordance searches at once Althoughcurrently it is possible to use multilingual TMs inCAT tools when starting a new project a languagepair has to be selected This leads to an underusageof the TM and the potential benefits of queryingmultiple languages at once are missed If for in-stance the TM contains a translation unit (TU) fora different pair of languages (eg German gt Ital-ian) than the ones selected for that specific project(eg German gt French) and the same source sen-tence is appearing in the text currently being trans-lated the translation into a different target lan-guage will not be shown (ie the Italian translationof the German TU will not be matched) The samewould occur with concordance searches Whilematches for a different language pair will not beused in the translation they may be useful for car-rying it out If a translator understands other targetlanguages these translations may give them a hintas to how to translate the same sentence into theirmother tongue5

Finally an interesting idea was to generate newsegments on the fly from fragments of previouslytranslated segments Flanagan (2015) offers aninteresting overview about the techniques usedby different CAT tools for subsegment matchingHere we will focus on memoQrsquos such functional-ity fragment assembly6

Figure 1 shows how this functionality worksin memoQ As may be observed memoQ looks

5For SDL Studio there seems to be an external appAnyTM that allows users to use TMs having differentlanguage pairs than the ones in the current translationproject As of Studio 2015 this app has been integrated inthe CAT tool and become a new feature However this tooldoes not seem to support the usage of truly multilingual TMs(TMs including several target languages for each segment)For further information on the tool see httpwwwtranslationzonecomopenexchangeappsdltradosstudioanytmtranslationprovider-669html)

6For further information see httpkilgraycommemoq2015help-enindexhtmlfragment_assemblyhtml

2

for fragments of the source sentence in other TMsegments and internally computes their alignmentprobabilities It then inserts the translations intothe source segment and suggests this new some-times partially translated sentence as a match Al-ternatively only the fragments translated will beinserted in the target segment one after anotherwithout the source sentence words that could notbe retrieved

Figure 1 memoQrsquos fragment assembly function-ality

One limitation of this functionality is that thefragment translations follow the order in whichthey appear in the source language Thus whileit may be very useful for pairs of languages whichfollow a similar structure it may be problematicfor pairs of languages which require reorderingmemoQ uses the frequencies of apparition of thefragments to select one translation or another foreach particular segment As a consequence insome cases the translation selection is wrong thusyielding wrong translation suggestions

Examples 1 and 2 show two similar sentencesin our TM

(1) EN The following message then appearsClick accept to run the programES Aparece el siguiente mensaje Hagaclick en aceptar para ejecutar el programa

(2) EN The window will show the followingThe application will closeES La ventana mostraraacute lo siguiente Secerraraacute la aplicacioacuten

Now imagine we need to translate the sentencein 3

(3) EN The following message then appearsThe application will close

Taking the part of 1 before the colon and the partof 2 after the colon we would be able to producethe right translation as shown in 4

(4) ES Aparece el siguiente mensaje Secerraraacute la aplicacioacuten

In technical texts it is often the case that situa-tions like the one just described happen more thanonce Thus it is not surprising that the transla-tors liked the idea and thought it would be nice tofind a way of automatically retrieving their trans-lations without having to do concordance searchesin the TM Moreover remembering that a partic-ular fragment of a segment had been translatedin the past is not always possible as translatorsmay have forgotten it or different translators mayhave been involved in the project thus not seeingfragments of a segment that other translators havetranslated already

As this idea seemed to have a great potential toincrease the number of TM and fuzzy matches wedecided to implement it and test whether it actu-ally worked The next Section (4) explains howwe proceeded

4 The new segment generator

As explained in Section 3 we decided to test oneidea originated from our internal survey The ideawas to generate new TM segments from fragmentsof already existing segments We called our newtool new segment generator

The first step was to assess the type of textsthat are translated in our company and identify thesegment fragments that could be easily extractedUpon analysis of several sample texts we identi-fied 7 different types of fragments we could workwith

1 Ellipsis2 Colons3 Parenthesis4 Square brackets5 Curly braces6 Quotations7 Text surrounded by tags

In the following subsections (41 ndash 43) we de-scribe how each of these types of fragments wastreated

3

41 Ellipsis and colonsOne possible type of segment would be that inwhich an ellipsis () or a colon () is used inthe middle of the segment In software localisationor user guides sentences such as the ones in 5 and6 could appear

(5) EN Installing new services ServiceXXXX for premium clients [2]ES Instalando nuevos servicios ServicioXXXX para clientes premium [2]

(6) EN You can use the line [abcdef] todescribe any of the following characters ab c d e fES Puede utilizar la liacutenea [abcdef] paradescribir cualquiera de los siguientescaracteres a b c d e f

If a different segment only including the textbefore the ellipsis appears as in Example 7 theTM may not retrieve any fuzzy match The samewould occur with other sentences with colons inwhich the fragment before or after the colon ap-pears

(7) Installing new services

In these cases we proceded as follows

1 Check that there is an ellipsis a colon onboth the source and the target segment

2 Split the segment in two being the first partthe fragment of the segment up to the ellip-siscolon and the second part the fragment ofthe segment after the ellipsiscolon

3 Create a new TM segment for each fragment

42 Parenthesis square brackets and curlybraces

Sometimes a sentence includes a fragment be-tween parethesis square brackets or curly bracesThe content within such characters may constitutea new segment on its own or appear in a differ-ent sentence At the same time it may also be thecase that the same sentence appears in the text butwithout such parenthesis When sentences like theones in Examples 8ndash10 appear it may thus be de-sirable to store the translation of the fragment be-tween the aforementioned characters and the sen-tence without such content

(8) EN Creates an installation package forapplication installation (if it was not createdearlier)ES Crea un paquete de instalacioacuten para lainstalacioacuten de la aplicacioacuten (si no se creoacuteantes)

(9) EN ltreturn code 1gt=[ltdescriptiongt]ES ltcoacutedigo de retorno 1gt=[ltdescripcioacutengt]

(10) EN Could not open key [2] Systemerror [3] Verify that you have sufficientaccess to that key or contact your supportpersonnelES Error al abrir la clave [2] Error en elsistema [3] Compruebe que dispone delos derechos de acceso

The strategy to create new segments was the fol-lowing

1 Check that there is content between parenthe-sis square brackets curly braces on both thesource and the target segment

2 Keep three fragments out of each sentence

(a) A sentence removing those charactersand the content between them

(b) A fragment starting at the opening char-acter and finishing on the closing oneand including the content within In thisfragment the parenthesis square brack-ets or curly braces are mantained

(c) A fragment containing only the con-tent withing those characters (parenthe-sis square brackets or curly braces) butwithout them

3 Create a new TM segment for each fragment

At this preliminary stage we considered thatwhen a sentence has several clauses in parenthesissquare brackets or curly braces these appear in thesame order in the target language This was doneso because for the type of texts used so far to testour application (software manuals) and the pair oflanguages used (English into Spanish) this seemsto be the usual case In future work we plan to fur-ther evaluate this issue and consider other waysof ensuring that the right translation is assigned toeach clause

4

43 Quotations and text within tagsQuotations and double tags appearing in the textwere handled differently As the text within thequotations or tags might be part of the sentencewhere it appears it could not be removed withoutadding too much noise to the data Thus we iden-tified sentences with quotations andor tags wethen removed the quotations andor tags and keptthe same sentence without them as a new segmentFinally we also kept the text within the quotationmarks or tags as new segments Examples 11ndash12illustrate this kind of segments

(11) EN You can only set the values of settingsthat the policy allows to be modified that isunlocked settingsES Solo se pueden establecer los valoresde los paraacutemetros que al directiva permitemodificar es decir los paraacutemetrosdesbloqueados

(12) EN If you clear the lt1gtInherit settingsfrom parent policylt1gt check box in thelt2gtSettings inheritancelt2gt section of thelt3gtGenerallt3gt section in the propertieswindow of an inherited policy the lockis lifted for that policyES Si anula la seleccioacuten de la casillalt1gtHeredar configuracioacuten de la directivaprimarialt1gt en la seccioacuten lt2gtHerencia deconfiguracioacutenlt2gt que aparece en la seccioacutenlt3gtGenerallt3gt de la ventana depropiedades de una directiva heredada seabriraacute el candado para esa directiva

5 Pilot test

Before integrating our system in a CAT tool andin our normal production workflows we deemedit better to run a pilot test The aim of this testwas to measure to which extent the new segmentsretrieved an increased number of 100 and fuzzymatches

51 Test setWe used as a test set a real translation project com-ing from one of our clients It is a software manualwritten in English and to be translated into Span-ish We selected memoQ 2015 to be the CAT toolused for our testing because it is one of the com-mon CAT tools used by our translators and be-cause we also wanted to measure the impact of

our approach when using its fragment assemblyfunctionality

The project had in total 425 segments account-ing for 6280 words according to memoQ Ta-ble 1 shows the project statistics as provided bymemoQrsquos analysis tool using the project TM pro-vided by the client Additionally memoQ iden-tified 36 segments (418 words) which could betranslated benefiting from its fragment assemblyfunctionality which uses fragments of segments tocreate new translations

TM match Words SegmentsRepetitions 1064 80

100 0 095-99 4 285-94 0 075-84 285 14

50-74 2523 187No Match 2404 142

Total 6280 425

Table 1 Project statistics according to memoQ us-ing the project TM

Taking this analysis as the starting point of ourpilot test we generated new segments using theapproach described in Section 4 We used threedifferent TMs to further assess whether the sizeof the translation memory matters for generatingtranslations of segments and retrieving more trans-lations The first TM (Project TM) was the projectTM provided by the client The second TM (Prod-uct TM) was a TM including all projects done forthe same product of the client Finally the thirdTM (Client TM) included all projects of that clientand thus was the biggest one Table 2 summarisesthe size of the three TMs

Segments WordsEN ES

Project TM 16842 212472 244159456631

Product TM 20923 274542 317797592339

Client TM 256099 3427861 39517327379593

Table 2 Size of the different TMs used

We then generated new TM segments and storedthem as new TMs Table 3 shows the number ofnew segments generated using our approach

Table 4 breaks down the number of segmentsgenerated using each strategy and for each TM

5

Segments EN ESProject TM 6776 56973 66297

new segments 123270Product TM 7760 71034 83125

new segments 154159new Client TM 74041 662714 769705new segments 1432419

Table 3 Size of the new TMs generated using ourapproach

used As can be observed some types are moreproductive than others When using the smallerTMs (project and product) the most prolific seg-ment generator category was the one which ex-tracted text surrounded by tags However whenusing the whole clientrsquos TM the text betweenparenthesis was more prolific

TM Proj TM Prod TM clientEllipsis 7 6 50Colon 1801 2094 17894Parenthesis 1361 1621 29637Square bra 78 73 598Curly braces 0 0 0Quotations 1085 1146 8797Tags 2523 2892 17792

Table 4 Number of newly generated segments pertype and TM used

We then tested how many segments would beretrieved using our newly created TMs both aloneand in combination with the TMs we previouslyhad MemoQ offers the possibility of activatingand deactivating different TMs when preparing afile for translation We thus prepared the projectfile using 11 combinations to assess which com-bination performed better as well as whether thenew TMs where having any impact in the projectThese 11 scenarios were the following

1 TM1 The project TM as provided by theclient

2 TM2 Only the new segments generatedfrom the project TM provided by the client

3 TM3 A combination of the project TM andthe new segments retrieved from it

4 TM4 Only the new segments generatedfrom the product TM

5 TM5 The project TM and the new segmentsgenerated from the product TM

6 TM6 The project TM combined withthe new segments TMs generated from theproject TM and the ones from the productTM

7 TM7 Only the new segments generatedfrom the client TM

8 TM8 The project TM combined with thenew segments generated from the client TM

9 TM9 The project TM combined with thenew TMs generated from the client TM andthe ones from the project TM

10 TM10 The project TM combined with thenew segments generated from the client TMand the ones from the product TM

11 TM11 The project TM combined with thenew segments generated from the client TMthe ones from the project TM and the onesfrom the product TM

The preparation of a file for translation typ-ically includes both analysing the file and pre-translating it When the fragment assembly func-tionality from memoQ is activated informationabout how many segments could be translated us-ing fragments is also provided Tables 5 and 6summarise the results we obtained for each TMenvironment when preparing the project for trans-lation

As can be observed using the TMs with newsegments decreased in all cases the number of seg-ments not started and increased the number of seg-ments translated using fragments (cf Table 5) Italso seems clear that size matters and that the big-ger the TM with new fragments the higher thenumber of segments that benefit from fragments(cf TM2 TM4 and TM7 in Table 5)

However this does not hold true for the pre-translation In all cases in which the new TMswere used in isolation (TM2 TM4 and TM7) thenumber of pretranslated segments decreases Thismay be due to the fact that those TMs only containfragments of the original segments present in thedifferent TMs used to generate the segments

When combined with the project TM the num-ber of pretranslated segments increases (cf TM3TM5 TM6 and TM8-TM11) The best overall re-sults are obtained when using the project TM ei-ther in combination with both the new TM gener-ated from the project TM and the new TM gen-

6

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194

Table 5 Overview on the number of segments pre-translated translated using fragments or not started

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11

100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87

Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band

erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)

If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band

In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations

from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators

6 Conclusion and future work

In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent

We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators

We have also envisioned the combination ofalready existing methods for retrieving a higher

7

number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)

Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesKevin Flanagan 2015 Subsegment recall in Transla-

tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January

Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)

Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT

Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria

Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008

Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters

Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas

Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT

8

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015

Spotting false translation segments in translation memories

Eduard BarbuTranslatednet

eduardtranslatednet

Abstract

The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories

1 Introduction

MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface

The quality of the translations using the firstmethod is high and the errors are relatively few

1httpsmymemorytranslatednet

6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V

Figure 1 The distribution of automatically ac-quired memories in MyMemory

However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions

The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task

The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments

9

2 Related Work

The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval

There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora

A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours

2httpwwwxbenchnet

3 The data

In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types

1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web

2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)

3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments

4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)

The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors

10

It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1

CG =ls minus ldradic

34(ls + ld)(1)

In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments

If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426

The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the

3This simple idea is implemented in many sentence align-ers

4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom

Church Gale ScoreP

rob

ab

ility C

hu

rch

Ga

le S

co

re

minus4 minus2 0 2 4

00

02

04

06

08

Figure 2 The distribution of Church Gale Scoresin the Matecat Set

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

01

02

03

04

Figure 3 The distribution of Church Gale Scoresin the Collaborative Set

11

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

00

00

04

00

08

Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet

Church Gale score varies in the interval minus131516015

To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6

In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows

bull Training Set It contains 1243 bi-segmentsand has 373 negative example

bull Test Set It contains 309 bi-segments and has87 negatives examples

The proportion of the negative examples in bothsets is approximately 30

minus4 minus2 0 2 4

minus4

minus2

02

4Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 5 The Q-Q plot for the Matecat set

minus4 minus2 0 2 4

minus1

00

minus5

00

50

Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 6 The Q-Q plot for the Collaborative set

12

4 Machine Learning

In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them

41 FeaturesThe features computed for the training and test setare the following

bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt

bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language

bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0

bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0

bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0

bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0

bull has capital letters The value of the featureis 1 if the source or target segments contain

words that have at least a capital letter other-wise is 0

bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters

bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions

bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)

bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties

bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity

bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity

13

bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment

bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters

bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters

bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2

All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation

The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment

is done with the highly accurate language detectorCybozu5

42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)

bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree

bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees

bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores

bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms

bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances

bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can

5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd

14

give good results in classification problemswhere the decision boundary is irregular

5 Results and discussion

We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1

Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071

Table 1 The results of the three-fold stratifiedclassification

Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)

The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2

The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set

Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059

Table 2 The results of the classification on the testset

Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section

6 Conclusions and further work

In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the

15

performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database

There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative

Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471

ReferencesWilliam A Gale and Kenneth W Church 1993 A

program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS

B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April

Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-

lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31

Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830

Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226

Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory

Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas

16

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015

Abstract

We propose the integration of clause

splitting as a pre-processing step for match

retrieval in Translation Memory (TM)

systems to increase the number of relevant

sub-segment matches Through a series of

experiments we investigate the impact of

clause splitting in instances where the input

does not match an entire segment in the

TM but only a clause from a segment Our

results show that there is a statistically

significant increase in the number of

retrieved matches when both the input

segments and the segments in the TM are

first processed with a clause splitter

1 Rationale

Translation memory tools have had a great

impact on the translation industry as they

provide considerable assistance to translators

They allow translators to easily re-use

previous translations providing them with

valuable productivity gains in an industry

where there is a great demand for quality

translation delivered in the shortest possible

time However existing tools have some

shortcomings The majority of existing tools

rely on Levenshtein distance and seek to

identify matches only at the sentence level

Semantically similar segments are therefore

difficult to retrieve if the string similarity is not

high enough as are sub-segment matches

because if only part of a sentence matches the

input even if this part is an entire clause it is

unlikely that this sentence would be retrieved

(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text

types such as updated versions of technical

manuals

In this study we aim to address the

problem of retrieving sub-segment matches by

performing clause splitting on the source

segment as a pre-processing step for TM match

retrieval While matches for entire sentences

or almost entire sentences are the most useful

type of matches it is also less likely for such

matches to be found in most text types and

even less so for complex sentences Retrieving

clauses is desirable because there is a higher

chance for a match to be found for a clause than

for a complex sentence and at the same time

clauses are similar to sentences in that they

both contain a subject and a verb hence a

ldquocomplete thoughtrdquo therefore clause matches

are more likely to be in context and to actually

be used by the translator than phrase matches

for example

We perform experiments comparing the

match retrieval performance of TM tools when

they are used as is and when the input file and

the TM segments are first processed with a

clause splitter before being fed into the TM

tool The paper is organised as follows In

section 2 we discuss related work on TM

matching In section 3 we discuss how clause

splitting can be beneficial to TM matching and

how clause splitting was implemented for this

study We then describe our experiments in

section 4 and discuss the results in section 5

Finally we present our conclusion and future

work in section 6

Improving Translation Memory Matching through Clause

Splitting

Katerina Timonera

Research Group in Computational

Linguistics

University of Wolverhampton

krtimoneragmailcom

Ruslan Mitkov

Research Group in Computational

Linguistics

University of Wolverhampton

rmitkovwlvacuk

17

2 Related Work

Attempts to address the shortcomings of

existing tools include the integration of

language processing to break down a sentence

into smaller segments The so-called lsquosecond

generationrsquo TM system Similis (Planas 2005)

performs chunking to split sentences into syntagmas to allow sub-sentence matching

However Reinke (2013) observes that for

certain language pairs like English-German

only rather short phrases like simple NPs are

identified and larger syntactic units cannot be

retrieved This can be regarded as disadvantage

as the processing of larger units would be

desirable for the support of professional

computer-assisted human translation (Kriele

(2006) and Macken (2009) cited in Reinke

2013)

MetaMorphoTM (Hodaacutesz and Pohl 2005)

also divides sentences into smaller chunks Moreover it uses a multi-level linguistic

similarity technique (surface form lemma

word class) to determine similarity between

two source-language segments

Other attempts involve deeper linguistic

processing techniques In Pekar and Mitkov

(2007) we propose the lsquothird-generation

translation memoryrsquo which introduces the

concept of semantic matching We employ

syntactic and semantic analysis of segments

stored in a TM to produce a generalised

representation of segments which reduces

equivalent lexical syntactic and lexico-

syntactic constructions into a single

representation Then a retrieval mechanism

operating on these generalised

representations is used to search for useful

previous translations in the TM

This study is part of the third-generation

translation memory project of which the

ultimate goal is to produce more intelligent TM

systems using NLP techniques To the best of

our knowledge clause splitting has not

previously been investigated as a possible

method for increasing the number of relevant

retrieved matches

3 Clause splitting for TM matching

Macklovitch and Russel (2000) note that

when a sufficiently close match cannot be

found for a new input sentence current TM

systems are unable to retrieve sentences that

contain the same clauses or other major

phrases

For example (a) below is a new input

sentence composed of twenty five-character

words The TM contains the sentence (b)

which shares an identical substring with

sentence (a) However as this substring only

makes up only 25 of the sentences total

number of characters it is unlikely that current

TM tools would be able to retrieve it as a fuzzy

match

(a) w1 w2 w3 w4 w5 w6 w20

(b) w1 w2 w3 w4 w5 w21 w35

If clause splitting were employed clauses

would be treated as separate segments thus

increasing the likelihood that clauses which

are subparts of larger units could have a match

score sufficiently high to be retrieved by the

TM system

In this paper we compare the effect of

performing clause splitting before retrieving

matches in a TM In each experiment we

identify the difference in matching

performance when a TM tool is used as is and

when the input file and the translation memory

are first run through a clause splitter The

clause splitter we use in this study is a modified version of the one described in Puscasu (2004)

The original version employs both machine

learning and linguistic rules to identify finite

clauses for both English and Romanian but in

this version only the rule-based module is

used Puscasu (2004) developed a clause

splitting method for both English and

Romanian and to maintain consistency

between the two languages her definition of a

clause is the one prescribed by the Romanian

Academy of Grammar which is that a clause is

18

group of words containing a finite verb

Nonfinite and verbless clauses are therefore

not considered The reported F-measure for

identifying complete clauses in English is

8139 (Marsic 2011)

In this study the clause splitter is used on

both the segments in the input file and the translation memory database After processing

the input and the TM segments with the clause

splitter these were then imported into existing

TM tools to examine how well these tools will

perform if clause splitting is used in pre-

processing

4 Experiments

Experiments were performed to study the

impact of clause splitting when used in pre-

processing for the retrieval of segments in a

TM Our hypothesis is that when a clause

splitter is used the number of relevant

retrieved matches will increase

The effect of clause splitting is examined by

comparing the number of matches retrieved

when TM tools are used as is and when both

the input segments and the segments in the

translation memory are first processed with a

clause splitter before being imported into the

TM tools The tools used are Wordfast

Professional 3 and Trados Studio 2009 which

are among the most widely used TM tools

(Lagoudaki 2006)

Segments used as the input were selected

from the Edinburgh paraphrase corpus (Cohn

Callison-Burch and Lapata 2008) (in Macken

2009) We use a paraphrase corpus because we

wish to investigate the effect of using a clause

splitter in pre-processing to retrieve both

segments that contain the entire input clause

and segments that do not contain the exact

input clause but may still be relevant as they

contain a clause that shares a considerable

degree of similarity with the input

We examine the segments retrieved using

both the default fuzzy match threshold (75

for Wordfast and 70 for SDL Trados) and the

minimum threshold (40 for Wordfast and

30 for Trados) It is not normally

recommended for translators to set a low fuzzy

match threshold as this might result in the

retrieval of too many irrelevant segments if the

translation memory is large However in this

study we argue it would be beneficial to

examine matches retrieved with the minimum

threshold as well Given that translation

memory match scores are mainly calculated

using Levenshtein distance if only one clause

in a segment in the TM matches the input there

is a greater chance of the segment being

retrieved with a lower threshold We therefore

wish to examine whether the employment of

clause splitting will still result in a considerable

improvement from using the Levenshtein

distance-based matching algorithm in most TM

tools if the match threshold setting is already

optimised for the retrieval of sub-segment

clauses

It must also be noted that for this study we

are working with the source segments only

Therefore in the TM files used both the source

and target segments are in English

We conducted two main sets of experiments

referred to as Set A and Set B which are

outlined below In Set A we selected sentences

from the Edinburgh corpus that contained

more than one clause We use one clause or

part of it as the input segment and we store

the entire sentence in the TM In the

experiments where no clause splitting is done

the sentence is stored as is In the experiments

with clause splitting the original input

segments are split into clauses (if there are more than one) and the segments in the TM are

the component clauses of the original sentence

For the experiments done without clause

splitting there are 150 input segments and for

each one we test whether the longer

corresponding segment in the TM can be

retrieved For the experiments where clause

splitting is used the 150 input segments are

split into 180 segments as some of these

segments contain more than one clause We

then test whether the corresponding clause

from the original longer sentence can be

retrieved An example is presented in Table 1

19

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 2: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

The Workshop onNatural Language Processing

for Translation Memories (NLP4TM)associated with THE INTERNATIONAL CONFERENCE

RECENT ADVANCES INNATURAL LANGUAGE PROCESSING 2015

PROCEEDINGS

Hissar Bulgaria11 September 2015

ii

Introduction

Translation Memories (TM) are amongst the most used tools by professional translators The underlyingidea of TMs is that a translator should benefit as much as possible from previous translations by beingable to retrieve how a similar sentence was translated before Despite the fact that the core idea of thesesystems relies on comparing segments (typically of sentence length) from the document to be translatedwith segments from previous translations most of the existing TM systems hardly use any languageprocessing for this Instead of addressing this issue most of the work on translation memories focusedon improving the user experience by allowing processing of a variety of document formats intuitive userinterfaces and so on

The term second generation translation memories has been around for more than ten years and it promisestranslation memory software that integrates linguistic processing in order to improve the translationprocess This linguistic processing can involve the matching of subsentential chunks the editing ofdistance operations between syntactic trees and the incorporation of semantic and discourse informationin the matching process Terminologies glossaries and ontologies are also very useful for translationmemories facilitating the task of the translator and ensuring a consistent translation The field of NaturalLanguage Processing (NLP) has proposed numerous methods for terminology extraction and ontologyextraction which can be integrated in the translation process The building of translation memories fromcorpora is another field where methods from NLP can contribute to improving the translation process

We are happy we could include in the workshop programme 4 long contributions and 3 short papersdealing with the aforementioned issues

Viacutet Baisa Aleš Horaacutek and Marek Medvedrsquo discuss in Increasing Coverage of Translation Memories withLinguistically Motivated Segment Combination Methods how it is possible to extend existing translationmemories using linguistically motivated segments combining approaches concentrated on preservinghigh translational quality

Linked to the topic of enhancing existing translation memories in the paper Spotting false translationsegments in translation memories Eduard Barbu presents a method for identifying false translations intranslation memories thought as a classification task

In Improving translation memory fuzzy matching by paraphrasing Konstantinos Chatzitheodorouexplores the use of paraphrasing in retrieving better segments from translation memories The methodrelies on NooJ and performs consistently better than the state of the art on EN-IT language pair

CATaLog New Approaches to TM and Post Editing Interfaces presents a new CAT tool by Tapas NayekSudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Vela and Josef van Genabith The aim ofthe tool is to improve both the performance and the productivity of post-editing

Carla Parra Escartiacuten aims at bridging the gap between academic research on Translation Memories (TMs)and the actual needs and wishes of translators in Creation of new TM segments Fulfilling translatorsrsquowishes She presents a pilot study where the requests of translators are being implemented in thetranslation workflow

Katerina Timonera and Ruslan Mitkov explore the use of clause splitting in better retrieval of segmentsTheir paper Improving Translation Memory Matching through Clause Splitting show that their methodleads to a statistically significant increase in the number of retrieved matches when both the inputsegments and the segments in the TM are first processed with a clause splitter

The Organising Committee would like to thank the Programme Committee who responded with very fastbut also substantial reviews for the workshop programme This workshop would not have been possible

iii

without the support received from the EXPERT project (FP72007-2013 under REA grant agreement no317471 httpexpert-itneu)

Constantin Orasan and Rohit Gupta

iv

Organizers

Constantin Orasan University of Wolverhampton UK

Rohit Gupta University of Wolverhampton UK

Program Committee

Manuel Arcedillo Hermes SpainJuanjo Arevalillo Hermes SpainEduard Barbu Translated ItalyYves Champollion WordFast FranceGloria Corpas University of Malaga SpainMaud Ehrmann EPFL SwitzerlandKevin Flanagan Swansea University UKGabriela Gonzalez eTrad ArgentinaManuel Herranz Pangeanic SpainQun Liu DCU IrelandRuslan Mitkov University of Wolverhampton UKGabor Proszeky Morphologic HungaryUwe Reinke Cologne University of Applied Sciences GermanyMichel Simard NRC CanadaMark Shuttleworth UCL UKMasao Utiyama NICT JapanAndy Way DCU IrelandMarcos Zampieri Saarland University and DFKI GermanyVentsislav Zhechev Autodesk

Invited Speaker

Marcello Federico Fondazione Bruno Kessler Italy

v

Table of Contents

Creation of new TM segments Fulfilling translatorsrsquo wishesCarla Parra Escartiacuten 1

Spotting false translation segments in translation memoriesEduard Barbu 9

Improving Translation Memory Matching through Clause SplittingKaterina Raisa Timonera and Ruslan Mitkov 17

Improving translation memory fuzzy matching by paraphrasingKonstantinos Chatzitheodoroou 24

Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Meth-ods

Viacutet Baisa Ales Horak and Marek Medvedrsquo 31

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek Sudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Vela and Josef van

Genabith 36

vii

Conference Program

900ndash910 WelcomeConstantin Orasan

910ndash1010 Automatically tidying up and extending translation memories (invited talk)Marcello Federico FBK Italy

1015ndash1100 Creation of new TM segments Fulfilling translatorsrsquo wishesCarla Parra Escartiacuten

1130ndash1215 Spotting false translation segments in translation memoriesEduard Barbu

1215ndash1300 Improving Translation Memory Matching through Clause SplittingKaterina Raisa Timonera and Ruslan Mitkov

1430ndash1500 Improving translation memory fuzzy matching by paraphrasingKonstantinos Chatzitheodoroou

1500ndash1530 Increasing Coverage of Translation Memories with Linguistically Motivated Seg-ment Combination MethodsViacutet Baisa Ales Horak and Marek Medvedrsquo

1530ndash1600 CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek Sudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Velaand Josef van Genabith

1615ndash1700 Round table and final discussion

ix

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 1ndash8Hissar Bulgaria Sept 2015

Creation of new TM segments Fulfilling translatorsrsquo wishes

Carla Parra EscartiacutenHermes Traducciones

C Coacutelquide 6 portal 2 3ž I28230 Las Rozas Madrid Spain

carlaparrahermestranscom

Abstract

This paper aims at bridging the gap be-tween academic research on TranslationMemories (TMs) and the actual needs andwishes of translators Starting from aninternal survey in a translation companywe analyse what translators wished trans-lation memories offered them We are cur-rently implementing one of their sugges-tions to assess its feasibility and whetheror not it retrieves more TM matches asoriginally expected We report on how thesuggestion is being implemented and theresults obtained in a pilot study

1 Introduction

Professional translators use translation memorieson a daily basis In fact most translation projectsnowadays require the usage of Computer AssistedTranslation (CAT) tools Sometimes translatorswill freely choose to work with a particular tooland sometimes it will be the client who imposesthe usage of such tool One of the main advantagesof CAT tools is that they allow for the integrationof different productivity enhancement features

Translation Memories (TMs) are used to re-trieve past translations and partial past translations(fuzzy matches) Terminology databases (TBs) areused to enhance the coherence on terminology andto ensure that the right terminology is used in allprojects Moreover some CAT tools offer addi-tional functionalities such as the usage of predic-tive texts (Autosuggest in SDL Trados Studio1

and Muses in MemoQ2) the automatic assem-bly of fragments to produce translations of newsegments or specific customizable Quality As-surance (QA) features

1wwwsdlcom2wwwmemoqcom

In the context of a translation company the us-age of these productivity enhacement tools is partof the whole translation workflow Project man-agers use them to generate word counts and esti-mate the time and resources needed to make thetranslation They also use them to pre-translatefiles and clean them up prior to delivery to theclient Translators use these tools to translate re-vise and proofread the translations prior to finaldelivery

Several researchers have worked on enhancingTMs using Natural Language Processing (NLP)techniques (Hodaacutesz and Pohl 2005 Pekar andMitkov 2007 Mitkov and Corpas 2008 Simardand Fujita 2012 Gupta and Orasan 2014Vanallemeersch and Vandeghinste 2015 Guptaet al 2015) Despite reporting positive resultsit seems that the gap between academic researchand industry still exists We carried out an inter-nal survey in our company to detect potential newfeatures for TMs that could be implemented byour RampD team The aim was to identify potentialnew features that project managers and translatorswished for and that would enhance their produc-tivity In this paper we report on the results of thatsurvey and further explain how we are implement-ing one of the suggestions we received The re-mainder of this paper is structured as follows Sec-tion 3 summarises the survey we carried out andthe replies we got and our company is briefly in-troduced in Section 2 Section 4 explains how weimplemented one of the suggestions we receivedSubsections 41ndash43 describe how we created newTM segments out of existing TMs Section 5 re-ports on the evaluation on a pilot test to assess thereal usability of this approach Finally section 6summarises our work and discusses future work

2 Our company

Hermes is a leader company in the translation in-dustry in Spain It was founded in 1991 and has

1

a vast experience in multilingual localisation andtranslation projects With offices in Madrid andMaacutelaga the company has broad knowledge andexperience in computer-assisted translation soft-ware and specific localisation software includingSDL Trados memoQ Deacutejagrave Vu IBM Translation-Manager Star Transit WordFast Catalyst Pas-solo Across Idiom World Server Microsoft He-lium and Microsoft Localisation Studio amongothers

Our company objectives are built upon a solidfoundation which have allowed us to achieve adouble quality certification for translation services(UNE-EN-150382006 and ISO-90012008 stan-dards) Our in-house translators and project man-agers commit daily to provide our clients with highquality translations for different specialised fields(IT medicine technical manuals general textsetc)

3 Internal survey

We asked our in-house translators and projectmanagers for potential new functionalities thatCAT tools could offer them More concretelywe asked them which was according to themthe missing functionality as far as TMs are con-cerned In total 10 project manager and 14 trans-lators participated in this internal survey Whilenot all had a clear idea of what could be imple-mented some interesting suggestions came up

We gathered ideas such as scheduling automaticTM reorganisation to prevent that large TMs endup corrupted It is a well known fact that largeTMs need to be periodically reorganized (ie re-indexed and eventually cleaned-up) While thisfeature is available in standard CAT tools such asStudio 20143 and memoQ4 it is not always car-ried out automatically and it is difficult to estimate

3In previous versions of Studio TM reorganisationwas available for all types of TMs in the Transla-tion Memory Settings Dialog Box gt Performance andTuning As of Studio 2014 file-based TMs are au-tomatically reorganised while server-based TMs re-quire periodical reorganisation For further informa-tion see httpproducthelpsdlcomSDL20Trados20Studioclient_enRefO-TTM_SettingsTranslation_Memory_Settings_Dialog_Box__Performance_and_Tuninghtm

4memoQ actually has a repairing function the Trans-lation memory repair wizard which aims at repairing(ie re-indexing) a corrupted TM According to thedocumentation it is also possible to run this functionon TMs which are not corrupted For further infor-mation see httpkilgraycommemoq2015help-enindexhtmlrepair_resource3html

when such reorganisation should be carried outScheduling it to be run automatically when a par-ticular number of segments (eg 500) have beenadded since the last reorganisation for instancewould prevent the loss of a TM because of badmaintenance

Ideas more related to NLP included the auto-matic correction of orthotypography in the tar-get language and allowing for multilingual TMswhere several source and target languages can beused for concordance searches at once Althoughcurrently it is possible to use multilingual TMs inCAT tools when starting a new project a languagepair has to be selected This leads to an underusageof the TM and the potential benefits of queryingmultiple languages at once are missed If for in-stance the TM contains a translation unit (TU) fora different pair of languages (eg German gt Ital-ian) than the ones selected for that specific project(eg German gt French) and the same source sen-tence is appearing in the text currently being trans-lated the translation into a different target lan-guage will not be shown (ie the Italian translationof the German TU will not be matched) The samewould occur with concordance searches Whilematches for a different language pair will not beused in the translation they may be useful for car-rying it out If a translator understands other targetlanguages these translations may give them a hintas to how to translate the same sentence into theirmother tongue5

Finally an interesting idea was to generate newsegments on the fly from fragments of previouslytranslated segments Flanagan (2015) offers aninteresting overview about the techniques usedby different CAT tools for subsegment matchingHere we will focus on memoQrsquos such functional-ity fragment assembly6

Figure 1 shows how this functionality worksin memoQ As may be observed memoQ looks

5For SDL Studio there seems to be an external appAnyTM that allows users to use TMs having differentlanguage pairs than the ones in the current translationproject As of Studio 2015 this app has been integrated inthe CAT tool and become a new feature However this tooldoes not seem to support the usage of truly multilingual TMs(TMs including several target languages for each segment)For further information on the tool see httpwwwtranslationzonecomopenexchangeappsdltradosstudioanytmtranslationprovider-669html)

6For further information see httpkilgraycommemoq2015help-enindexhtmlfragment_assemblyhtml

2

for fragments of the source sentence in other TMsegments and internally computes their alignmentprobabilities It then inserts the translations intothe source segment and suggests this new some-times partially translated sentence as a match Al-ternatively only the fragments translated will beinserted in the target segment one after anotherwithout the source sentence words that could notbe retrieved

Figure 1 memoQrsquos fragment assembly function-ality

One limitation of this functionality is that thefragment translations follow the order in whichthey appear in the source language Thus whileit may be very useful for pairs of languages whichfollow a similar structure it may be problematicfor pairs of languages which require reorderingmemoQ uses the frequencies of apparition of thefragments to select one translation or another foreach particular segment As a consequence insome cases the translation selection is wrong thusyielding wrong translation suggestions

Examples 1 and 2 show two similar sentencesin our TM

(1) EN The following message then appearsClick accept to run the programES Aparece el siguiente mensaje Hagaclick en aceptar para ejecutar el programa

(2) EN The window will show the followingThe application will closeES La ventana mostraraacute lo siguiente Secerraraacute la aplicacioacuten

Now imagine we need to translate the sentencein 3

(3) EN The following message then appearsThe application will close

Taking the part of 1 before the colon and the partof 2 after the colon we would be able to producethe right translation as shown in 4

(4) ES Aparece el siguiente mensaje Secerraraacute la aplicacioacuten

In technical texts it is often the case that situa-tions like the one just described happen more thanonce Thus it is not surprising that the transla-tors liked the idea and thought it would be nice tofind a way of automatically retrieving their trans-lations without having to do concordance searchesin the TM Moreover remembering that a partic-ular fragment of a segment had been translatedin the past is not always possible as translatorsmay have forgotten it or different translators mayhave been involved in the project thus not seeingfragments of a segment that other translators havetranslated already

As this idea seemed to have a great potential toincrease the number of TM and fuzzy matches wedecided to implement it and test whether it actu-ally worked The next Section (4) explains howwe proceeded

4 The new segment generator

As explained in Section 3 we decided to test oneidea originated from our internal survey The ideawas to generate new TM segments from fragmentsof already existing segments We called our newtool new segment generator

The first step was to assess the type of textsthat are translated in our company and identify thesegment fragments that could be easily extractedUpon analysis of several sample texts we identi-fied 7 different types of fragments we could workwith

1 Ellipsis2 Colons3 Parenthesis4 Square brackets5 Curly braces6 Quotations7 Text surrounded by tags

In the following subsections (41 ndash 43) we de-scribe how each of these types of fragments wastreated

3

41 Ellipsis and colonsOne possible type of segment would be that inwhich an ellipsis () or a colon () is used inthe middle of the segment In software localisationor user guides sentences such as the ones in 5 and6 could appear

(5) EN Installing new services ServiceXXXX for premium clients [2]ES Instalando nuevos servicios ServicioXXXX para clientes premium [2]

(6) EN You can use the line [abcdef] todescribe any of the following characters ab c d e fES Puede utilizar la liacutenea [abcdef] paradescribir cualquiera de los siguientescaracteres a b c d e f

If a different segment only including the textbefore the ellipsis appears as in Example 7 theTM may not retrieve any fuzzy match The samewould occur with other sentences with colons inwhich the fragment before or after the colon ap-pears

(7) Installing new services

In these cases we proceded as follows

1 Check that there is an ellipsis a colon onboth the source and the target segment

2 Split the segment in two being the first partthe fragment of the segment up to the ellip-siscolon and the second part the fragment ofthe segment after the ellipsiscolon

3 Create a new TM segment for each fragment

42 Parenthesis square brackets and curlybraces

Sometimes a sentence includes a fragment be-tween parethesis square brackets or curly bracesThe content within such characters may constitutea new segment on its own or appear in a differ-ent sentence At the same time it may also be thecase that the same sentence appears in the text butwithout such parenthesis When sentences like theones in Examples 8ndash10 appear it may thus be de-sirable to store the translation of the fragment be-tween the aforementioned characters and the sen-tence without such content

(8) EN Creates an installation package forapplication installation (if it was not createdearlier)ES Crea un paquete de instalacioacuten para lainstalacioacuten de la aplicacioacuten (si no se creoacuteantes)

(9) EN ltreturn code 1gt=[ltdescriptiongt]ES ltcoacutedigo de retorno 1gt=[ltdescripcioacutengt]

(10) EN Could not open key [2] Systemerror [3] Verify that you have sufficientaccess to that key or contact your supportpersonnelES Error al abrir la clave [2] Error en elsistema [3] Compruebe que dispone delos derechos de acceso

The strategy to create new segments was the fol-lowing

1 Check that there is content between parenthe-sis square brackets curly braces on both thesource and the target segment

2 Keep three fragments out of each sentence

(a) A sentence removing those charactersand the content between them

(b) A fragment starting at the opening char-acter and finishing on the closing oneand including the content within In thisfragment the parenthesis square brack-ets or curly braces are mantained

(c) A fragment containing only the con-tent withing those characters (parenthe-sis square brackets or curly braces) butwithout them

3 Create a new TM segment for each fragment

At this preliminary stage we considered thatwhen a sentence has several clauses in parenthesissquare brackets or curly braces these appear in thesame order in the target language This was doneso because for the type of texts used so far to testour application (software manuals) and the pair oflanguages used (English into Spanish) this seemsto be the usual case In future work we plan to fur-ther evaluate this issue and consider other waysof ensuring that the right translation is assigned toeach clause

4

43 Quotations and text within tagsQuotations and double tags appearing in the textwere handled differently As the text within thequotations or tags might be part of the sentencewhere it appears it could not be removed withoutadding too much noise to the data Thus we iden-tified sentences with quotations andor tags wethen removed the quotations andor tags and keptthe same sentence without them as a new segmentFinally we also kept the text within the quotationmarks or tags as new segments Examples 11ndash12illustrate this kind of segments

(11) EN You can only set the values of settingsthat the policy allows to be modified that isunlocked settingsES Solo se pueden establecer los valoresde los paraacutemetros que al directiva permitemodificar es decir los paraacutemetrosdesbloqueados

(12) EN If you clear the lt1gtInherit settingsfrom parent policylt1gt check box in thelt2gtSettings inheritancelt2gt section of thelt3gtGenerallt3gt section in the propertieswindow of an inherited policy the lockis lifted for that policyES Si anula la seleccioacuten de la casillalt1gtHeredar configuracioacuten de la directivaprimarialt1gt en la seccioacuten lt2gtHerencia deconfiguracioacutenlt2gt que aparece en la seccioacutenlt3gtGenerallt3gt de la ventana depropiedades de una directiva heredada seabriraacute el candado para esa directiva

5 Pilot test

Before integrating our system in a CAT tool andin our normal production workflows we deemedit better to run a pilot test The aim of this testwas to measure to which extent the new segmentsretrieved an increased number of 100 and fuzzymatches

51 Test setWe used as a test set a real translation project com-ing from one of our clients It is a software manualwritten in English and to be translated into Span-ish We selected memoQ 2015 to be the CAT toolused for our testing because it is one of the com-mon CAT tools used by our translators and be-cause we also wanted to measure the impact of

our approach when using its fragment assemblyfunctionality

The project had in total 425 segments account-ing for 6280 words according to memoQ Ta-ble 1 shows the project statistics as provided bymemoQrsquos analysis tool using the project TM pro-vided by the client Additionally memoQ iden-tified 36 segments (418 words) which could betranslated benefiting from its fragment assemblyfunctionality which uses fragments of segments tocreate new translations

TM match Words SegmentsRepetitions 1064 80

100 0 095-99 4 285-94 0 075-84 285 14

50-74 2523 187No Match 2404 142

Total 6280 425

Table 1 Project statistics according to memoQ us-ing the project TM

Taking this analysis as the starting point of ourpilot test we generated new segments using theapproach described in Section 4 We used threedifferent TMs to further assess whether the sizeof the translation memory matters for generatingtranslations of segments and retrieving more trans-lations The first TM (Project TM) was the projectTM provided by the client The second TM (Prod-uct TM) was a TM including all projects done forthe same product of the client Finally the thirdTM (Client TM) included all projects of that clientand thus was the biggest one Table 2 summarisesthe size of the three TMs

Segments WordsEN ES

Project TM 16842 212472 244159456631

Product TM 20923 274542 317797592339

Client TM 256099 3427861 39517327379593

Table 2 Size of the different TMs used

We then generated new TM segments and storedthem as new TMs Table 3 shows the number ofnew segments generated using our approach

Table 4 breaks down the number of segmentsgenerated using each strategy and for each TM

5

Segments EN ESProject TM 6776 56973 66297

new segments 123270Product TM 7760 71034 83125

new segments 154159new Client TM 74041 662714 769705new segments 1432419

Table 3 Size of the new TMs generated using ourapproach

used As can be observed some types are moreproductive than others When using the smallerTMs (project and product) the most prolific seg-ment generator category was the one which ex-tracted text surrounded by tags However whenusing the whole clientrsquos TM the text betweenparenthesis was more prolific

TM Proj TM Prod TM clientEllipsis 7 6 50Colon 1801 2094 17894Parenthesis 1361 1621 29637Square bra 78 73 598Curly braces 0 0 0Quotations 1085 1146 8797Tags 2523 2892 17792

Table 4 Number of newly generated segments pertype and TM used

We then tested how many segments would beretrieved using our newly created TMs both aloneand in combination with the TMs we previouslyhad MemoQ offers the possibility of activatingand deactivating different TMs when preparing afile for translation We thus prepared the projectfile using 11 combinations to assess which com-bination performed better as well as whether thenew TMs where having any impact in the projectThese 11 scenarios were the following

1 TM1 The project TM as provided by theclient

2 TM2 Only the new segments generatedfrom the project TM provided by the client

3 TM3 A combination of the project TM andthe new segments retrieved from it

4 TM4 Only the new segments generatedfrom the product TM

5 TM5 The project TM and the new segmentsgenerated from the product TM

6 TM6 The project TM combined withthe new segments TMs generated from theproject TM and the ones from the productTM

7 TM7 Only the new segments generatedfrom the client TM

8 TM8 The project TM combined with thenew segments generated from the client TM

9 TM9 The project TM combined with thenew TMs generated from the client TM andthe ones from the project TM

10 TM10 The project TM combined with thenew segments generated from the client TMand the ones from the product TM

11 TM11 The project TM combined with thenew segments generated from the client TMthe ones from the project TM and the onesfrom the product TM

The preparation of a file for translation typ-ically includes both analysing the file and pre-translating it When the fragment assembly func-tionality from memoQ is activated informationabout how many segments could be translated us-ing fragments is also provided Tables 5 and 6summarise the results we obtained for each TMenvironment when preparing the project for trans-lation

As can be observed using the TMs with newsegments decreased in all cases the number of seg-ments not started and increased the number of seg-ments translated using fragments (cf Table 5) Italso seems clear that size matters and that the big-ger the TM with new fragments the higher thenumber of segments that benefit from fragments(cf TM2 TM4 and TM7 in Table 5)

However this does not hold true for the pre-translation In all cases in which the new TMswere used in isolation (TM2 TM4 and TM7) thenumber of pretranslated segments decreases Thismay be due to the fact that those TMs only containfragments of the original segments present in thedifferent TMs used to generate the segments

When combined with the project TM the num-ber of pretranslated segments increases (cf TM3TM5 TM6 and TM8-TM11) The best overall re-sults are obtained when using the project TM ei-ther in combination with both the new TM gener-ated from the project TM and the new TM gen-

6

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194

Table 5 Overview on the number of segments pre-translated translated using fragments or not started

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11

100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87

Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band

erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)

If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band

In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations

from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators

6 Conclusion and future work

In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent

We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators

We have also envisioned the combination ofalready existing methods for retrieving a higher

7

number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)

Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesKevin Flanagan 2015 Subsegment recall in Transla-

tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January

Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)

Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT

Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria

Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008

Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters

Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas

Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT

8

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015

Spotting false translation segments in translation memories

Eduard BarbuTranslatednet

eduardtranslatednet

Abstract

The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories

1 Introduction

MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface

The quality of the translations using the firstmethod is high and the errors are relatively few

1httpsmymemorytranslatednet

6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V

Figure 1 The distribution of automatically ac-quired memories in MyMemory

However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions

The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task

The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments

9

2 Related Work

The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval

There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora

A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours

2httpwwwxbenchnet

3 The data

In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types

1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web

2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)

3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments

4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)

The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors

10

It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1

CG =ls minus ldradic

34(ls + ld)(1)

In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments

If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426

The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the

3This simple idea is implemented in many sentence align-ers

4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom

Church Gale ScoreP

rob

ab

ility C

hu

rch

Ga

le S

co

re

minus4 minus2 0 2 4

00

02

04

06

08

Figure 2 The distribution of Church Gale Scoresin the Matecat Set

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

01

02

03

04

Figure 3 The distribution of Church Gale Scoresin the Collaborative Set

11

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

00

00

04

00

08

Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet

Church Gale score varies in the interval minus131516015

To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6

In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows

bull Training Set It contains 1243 bi-segmentsand has 373 negative example

bull Test Set It contains 309 bi-segments and has87 negatives examples

The proportion of the negative examples in bothsets is approximately 30

minus4 minus2 0 2 4

minus4

minus2

02

4Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 5 The Q-Q plot for the Matecat set

minus4 minus2 0 2 4

minus1

00

minus5

00

50

Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 6 The Q-Q plot for the Collaborative set

12

4 Machine Learning

In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them

41 FeaturesThe features computed for the training and test setare the following

bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt

bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language

bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0

bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0

bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0

bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0

bull has capital letters The value of the featureis 1 if the source or target segments contain

words that have at least a capital letter other-wise is 0

bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters

bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions

bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)

bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties

bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity

bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity

13

bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment

bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters

bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters

bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2

All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation

The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment

is done with the highly accurate language detectorCybozu5

42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)

bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree

bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees

bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores

bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms

bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances

bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can

5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd

14

give good results in classification problemswhere the decision boundary is irregular

5 Results and discussion

We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1

Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071

Table 1 The results of the three-fold stratifiedclassification

Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)

The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2

The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set

Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059

Table 2 The results of the classification on the testset

Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section

6 Conclusions and further work

In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the

15

performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database

There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative

Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471

ReferencesWilliam A Gale and Kenneth W Church 1993 A

program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS

B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April

Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-

lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31

Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830

Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226

Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory

Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas

16

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015

Abstract

We propose the integration of clause

splitting as a pre-processing step for match

retrieval in Translation Memory (TM)

systems to increase the number of relevant

sub-segment matches Through a series of

experiments we investigate the impact of

clause splitting in instances where the input

does not match an entire segment in the

TM but only a clause from a segment Our

results show that there is a statistically

significant increase in the number of

retrieved matches when both the input

segments and the segments in the TM are

first processed with a clause splitter

1 Rationale

Translation memory tools have had a great

impact on the translation industry as they

provide considerable assistance to translators

They allow translators to easily re-use

previous translations providing them with

valuable productivity gains in an industry

where there is a great demand for quality

translation delivered in the shortest possible

time However existing tools have some

shortcomings The majority of existing tools

rely on Levenshtein distance and seek to

identify matches only at the sentence level

Semantically similar segments are therefore

difficult to retrieve if the string similarity is not

high enough as are sub-segment matches

because if only part of a sentence matches the

input even if this part is an entire clause it is

unlikely that this sentence would be retrieved

(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text

types such as updated versions of technical

manuals

In this study we aim to address the

problem of retrieving sub-segment matches by

performing clause splitting on the source

segment as a pre-processing step for TM match

retrieval While matches for entire sentences

or almost entire sentences are the most useful

type of matches it is also less likely for such

matches to be found in most text types and

even less so for complex sentences Retrieving

clauses is desirable because there is a higher

chance for a match to be found for a clause than

for a complex sentence and at the same time

clauses are similar to sentences in that they

both contain a subject and a verb hence a

ldquocomplete thoughtrdquo therefore clause matches

are more likely to be in context and to actually

be used by the translator than phrase matches

for example

We perform experiments comparing the

match retrieval performance of TM tools when

they are used as is and when the input file and

the TM segments are first processed with a

clause splitter before being fed into the TM

tool The paper is organised as follows In

section 2 we discuss related work on TM

matching In section 3 we discuss how clause

splitting can be beneficial to TM matching and

how clause splitting was implemented for this

study We then describe our experiments in

section 4 and discuss the results in section 5

Finally we present our conclusion and future

work in section 6

Improving Translation Memory Matching through Clause

Splitting

Katerina Timonera

Research Group in Computational

Linguistics

University of Wolverhampton

krtimoneragmailcom

Ruslan Mitkov

Research Group in Computational

Linguistics

University of Wolverhampton

rmitkovwlvacuk

17

2 Related Work

Attempts to address the shortcomings of

existing tools include the integration of

language processing to break down a sentence

into smaller segments The so-called lsquosecond

generationrsquo TM system Similis (Planas 2005)

performs chunking to split sentences into syntagmas to allow sub-sentence matching

However Reinke (2013) observes that for

certain language pairs like English-German

only rather short phrases like simple NPs are

identified and larger syntactic units cannot be

retrieved This can be regarded as disadvantage

as the processing of larger units would be

desirable for the support of professional

computer-assisted human translation (Kriele

(2006) and Macken (2009) cited in Reinke

2013)

MetaMorphoTM (Hodaacutesz and Pohl 2005)

also divides sentences into smaller chunks Moreover it uses a multi-level linguistic

similarity technique (surface form lemma

word class) to determine similarity between

two source-language segments

Other attempts involve deeper linguistic

processing techniques In Pekar and Mitkov

(2007) we propose the lsquothird-generation

translation memoryrsquo which introduces the

concept of semantic matching We employ

syntactic and semantic analysis of segments

stored in a TM to produce a generalised

representation of segments which reduces

equivalent lexical syntactic and lexico-

syntactic constructions into a single

representation Then a retrieval mechanism

operating on these generalised

representations is used to search for useful

previous translations in the TM

This study is part of the third-generation

translation memory project of which the

ultimate goal is to produce more intelligent TM

systems using NLP techniques To the best of

our knowledge clause splitting has not

previously been investigated as a possible

method for increasing the number of relevant

retrieved matches

3 Clause splitting for TM matching

Macklovitch and Russel (2000) note that

when a sufficiently close match cannot be

found for a new input sentence current TM

systems are unable to retrieve sentences that

contain the same clauses or other major

phrases

For example (a) below is a new input

sentence composed of twenty five-character

words The TM contains the sentence (b)

which shares an identical substring with

sentence (a) However as this substring only

makes up only 25 of the sentences total

number of characters it is unlikely that current

TM tools would be able to retrieve it as a fuzzy

match

(a) w1 w2 w3 w4 w5 w6 w20

(b) w1 w2 w3 w4 w5 w21 w35

If clause splitting were employed clauses

would be treated as separate segments thus

increasing the likelihood that clauses which

are subparts of larger units could have a match

score sufficiently high to be retrieved by the

TM system

In this paper we compare the effect of

performing clause splitting before retrieving

matches in a TM In each experiment we

identify the difference in matching

performance when a TM tool is used as is and

when the input file and the translation memory

are first run through a clause splitter The

clause splitter we use in this study is a modified version of the one described in Puscasu (2004)

The original version employs both machine

learning and linguistic rules to identify finite

clauses for both English and Romanian but in

this version only the rule-based module is

used Puscasu (2004) developed a clause

splitting method for both English and

Romanian and to maintain consistency

between the two languages her definition of a

clause is the one prescribed by the Romanian

Academy of Grammar which is that a clause is

18

group of words containing a finite verb

Nonfinite and verbless clauses are therefore

not considered The reported F-measure for

identifying complete clauses in English is

8139 (Marsic 2011)

In this study the clause splitter is used on

both the segments in the input file and the translation memory database After processing

the input and the TM segments with the clause

splitter these were then imported into existing

TM tools to examine how well these tools will

perform if clause splitting is used in pre-

processing

4 Experiments

Experiments were performed to study the

impact of clause splitting when used in pre-

processing for the retrieval of segments in a

TM Our hypothesis is that when a clause

splitter is used the number of relevant

retrieved matches will increase

The effect of clause splitting is examined by

comparing the number of matches retrieved

when TM tools are used as is and when both

the input segments and the segments in the

translation memory are first processed with a

clause splitter before being imported into the

TM tools The tools used are Wordfast

Professional 3 and Trados Studio 2009 which

are among the most widely used TM tools

(Lagoudaki 2006)

Segments used as the input were selected

from the Edinburgh paraphrase corpus (Cohn

Callison-Burch and Lapata 2008) (in Macken

2009) We use a paraphrase corpus because we

wish to investigate the effect of using a clause

splitter in pre-processing to retrieve both

segments that contain the entire input clause

and segments that do not contain the exact

input clause but may still be relevant as they

contain a clause that shares a considerable

degree of similarity with the input

We examine the segments retrieved using

both the default fuzzy match threshold (75

for Wordfast and 70 for SDL Trados) and the

minimum threshold (40 for Wordfast and

30 for Trados) It is not normally

recommended for translators to set a low fuzzy

match threshold as this might result in the

retrieval of too many irrelevant segments if the

translation memory is large However in this

study we argue it would be beneficial to

examine matches retrieved with the minimum

threshold as well Given that translation

memory match scores are mainly calculated

using Levenshtein distance if only one clause

in a segment in the TM matches the input there

is a greater chance of the segment being

retrieved with a lower threshold We therefore

wish to examine whether the employment of

clause splitting will still result in a considerable

improvement from using the Levenshtein

distance-based matching algorithm in most TM

tools if the match threshold setting is already

optimised for the retrieval of sub-segment

clauses

It must also be noted that for this study we

are working with the source segments only

Therefore in the TM files used both the source

and target segments are in English

We conducted two main sets of experiments

referred to as Set A and Set B which are

outlined below In Set A we selected sentences

from the Edinburgh corpus that contained

more than one clause We use one clause or

part of it as the input segment and we store

the entire sentence in the TM In the

experiments where no clause splitting is done

the sentence is stored as is In the experiments

with clause splitting the original input

segments are split into clauses (if there are more than one) and the segments in the TM are

the component clauses of the original sentence

For the experiments done without clause

splitting there are 150 input segments and for

each one we test whether the longer

corresponding segment in the TM can be

retrieved For the experiments where clause

splitting is used the 150 input segments are

split into 180 segments as some of these

segments contain more than one clause We

then test whether the corresponding clause

from the original longer sentence can be

retrieved An example is presented in Table 1

19

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 3: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

Introduction

Translation Memories (TM) are amongst the most used tools by professional translators The underlyingidea of TMs is that a translator should benefit as much as possible from previous translations by beingable to retrieve how a similar sentence was translated before Despite the fact that the core idea of thesesystems relies on comparing segments (typically of sentence length) from the document to be translatedwith segments from previous translations most of the existing TM systems hardly use any languageprocessing for this Instead of addressing this issue most of the work on translation memories focusedon improving the user experience by allowing processing of a variety of document formats intuitive userinterfaces and so on

The term second generation translation memories has been around for more than ten years and it promisestranslation memory software that integrates linguistic processing in order to improve the translationprocess This linguistic processing can involve the matching of subsentential chunks the editing ofdistance operations between syntactic trees and the incorporation of semantic and discourse informationin the matching process Terminologies glossaries and ontologies are also very useful for translationmemories facilitating the task of the translator and ensuring a consistent translation The field of NaturalLanguage Processing (NLP) has proposed numerous methods for terminology extraction and ontologyextraction which can be integrated in the translation process The building of translation memories fromcorpora is another field where methods from NLP can contribute to improving the translation process

We are happy we could include in the workshop programme 4 long contributions and 3 short papersdealing with the aforementioned issues

Viacutet Baisa Aleš Horaacutek and Marek Medvedrsquo discuss in Increasing Coverage of Translation Memories withLinguistically Motivated Segment Combination Methods how it is possible to extend existing translationmemories using linguistically motivated segments combining approaches concentrated on preservinghigh translational quality

Linked to the topic of enhancing existing translation memories in the paper Spotting false translationsegments in translation memories Eduard Barbu presents a method for identifying false translations intranslation memories thought as a classification task

In Improving translation memory fuzzy matching by paraphrasing Konstantinos Chatzitheodorouexplores the use of paraphrasing in retrieving better segments from translation memories The methodrelies on NooJ and performs consistently better than the state of the art on EN-IT language pair

CATaLog New Approaches to TM and Post Editing Interfaces presents a new CAT tool by Tapas NayekSudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Vela and Josef van Genabith The aim ofthe tool is to improve both the performance and the productivity of post-editing

Carla Parra Escartiacuten aims at bridging the gap between academic research on Translation Memories (TMs)and the actual needs and wishes of translators in Creation of new TM segments Fulfilling translatorsrsquowishes She presents a pilot study where the requests of translators are being implemented in thetranslation workflow

Katerina Timonera and Ruslan Mitkov explore the use of clause splitting in better retrieval of segmentsTheir paper Improving Translation Memory Matching through Clause Splitting show that their methodleads to a statistically significant increase in the number of retrieved matches when both the inputsegments and the segments in the TM are first processed with a clause splitter

The Organising Committee would like to thank the Programme Committee who responded with very fastbut also substantial reviews for the workshop programme This workshop would not have been possible

iii

without the support received from the EXPERT project (FP72007-2013 under REA grant agreement no317471 httpexpert-itneu)

Constantin Orasan and Rohit Gupta

iv

Organizers

Constantin Orasan University of Wolverhampton UK

Rohit Gupta University of Wolverhampton UK

Program Committee

Manuel Arcedillo Hermes SpainJuanjo Arevalillo Hermes SpainEduard Barbu Translated ItalyYves Champollion WordFast FranceGloria Corpas University of Malaga SpainMaud Ehrmann EPFL SwitzerlandKevin Flanagan Swansea University UKGabriela Gonzalez eTrad ArgentinaManuel Herranz Pangeanic SpainQun Liu DCU IrelandRuslan Mitkov University of Wolverhampton UKGabor Proszeky Morphologic HungaryUwe Reinke Cologne University of Applied Sciences GermanyMichel Simard NRC CanadaMark Shuttleworth UCL UKMasao Utiyama NICT JapanAndy Way DCU IrelandMarcos Zampieri Saarland University and DFKI GermanyVentsislav Zhechev Autodesk

Invited Speaker

Marcello Federico Fondazione Bruno Kessler Italy

v

Table of Contents

Creation of new TM segments Fulfilling translatorsrsquo wishesCarla Parra Escartiacuten 1

Spotting false translation segments in translation memoriesEduard Barbu 9

Improving Translation Memory Matching through Clause SplittingKaterina Raisa Timonera and Ruslan Mitkov 17

Improving translation memory fuzzy matching by paraphrasingKonstantinos Chatzitheodoroou 24

Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Meth-ods

Viacutet Baisa Ales Horak and Marek Medvedrsquo 31

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek Sudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Vela and Josef van

Genabith 36

vii

Conference Program

900ndash910 WelcomeConstantin Orasan

910ndash1010 Automatically tidying up and extending translation memories (invited talk)Marcello Federico FBK Italy

1015ndash1100 Creation of new TM segments Fulfilling translatorsrsquo wishesCarla Parra Escartiacuten

1130ndash1215 Spotting false translation segments in translation memoriesEduard Barbu

1215ndash1300 Improving Translation Memory Matching through Clause SplittingKaterina Raisa Timonera and Ruslan Mitkov

1430ndash1500 Improving translation memory fuzzy matching by paraphrasingKonstantinos Chatzitheodoroou

1500ndash1530 Increasing Coverage of Translation Memories with Linguistically Motivated Seg-ment Combination MethodsViacutet Baisa Ales Horak and Marek Medvedrsquo

1530ndash1600 CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek Sudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Velaand Josef van Genabith

1615ndash1700 Round table and final discussion

ix

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 1ndash8Hissar Bulgaria Sept 2015

Creation of new TM segments Fulfilling translatorsrsquo wishes

Carla Parra EscartiacutenHermes Traducciones

C Coacutelquide 6 portal 2 3ž I28230 Las Rozas Madrid Spain

carlaparrahermestranscom

Abstract

This paper aims at bridging the gap be-tween academic research on TranslationMemories (TMs) and the actual needs andwishes of translators Starting from aninternal survey in a translation companywe analyse what translators wished trans-lation memories offered them We are cur-rently implementing one of their sugges-tions to assess its feasibility and whetheror not it retrieves more TM matches asoriginally expected We report on how thesuggestion is being implemented and theresults obtained in a pilot study

1 Introduction

Professional translators use translation memorieson a daily basis In fact most translation projectsnowadays require the usage of Computer AssistedTranslation (CAT) tools Sometimes translatorswill freely choose to work with a particular tooland sometimes it will be the client who imposesthe usage of such tool One of the main advantagesof CAT tools is that they allow for the integrationof different productivity enhancement features

Translation Memories (TMs) are used to re-trieve past translations and partial past translations(fuzzy matches) Terminology databases (TBs) areused to enhance the coherence on terminology andto ensure that the right terminology is used in allprojects Moreover some CAT tools offer addi-tional functionalities such as the usage of predic-tive texts (Autosuggest in SDL Trados Studio1

and Muses in MemoQ2) the automatic assem-bly of fragments to produce translations of newsegments or specific customizable Quality As-surance (QA) features

1wwwsdlcom2wwwmemoqcom

In the context of a translation company the us-age of these productivity enhacement tools is partof the whole translation workflow Project man-agers use them to generate word counts and esti-mate the time and resources needed to make thetranslation They also use them to pre-translatefiles and clean them up prior to delivery to theclient Translators use these tools to translate re-vise and proofread the translations prior to finaldelivery

Several researchers have worked on enhancingTMs using Natural Language Processing (NLP)techniques (Hodaacutesz and Pohl 2005 Pekar andMitkov 2007 Mitkov and Corpas 2008 Simardand Fujita 2012 Gupta and Orasan 2014Vanallemeersch and Vandeghinste 2015 Guptaet al 2015) Despite reporting positive resultsit seems that the gap between academic researchand industry still exists We carried out an inter-nal survey in our company to detect potential newfeatures for TMs that could be implemented byour RampD team The aim was to identify potentialnew features that project managers and translatorswished for and that would enhance their produc-tivity In this paper we report on the results of thatsurvey and further explain how we are implement-ing one of the suggestions we received The re-mainder of this paper is structured as follows Sec-tion 3 summarises the survey we carried out andthe replies we got and our company is briefly in-troduced in Section 2 Section 4 explains how weimplemented one of the suggestions we receivedSubsections 41ndash43 describe how we created newTM segments out of existing TMs Section 5 re-ports on the evaluation on a pilot test to assess thereal usability of this approach Finally section 6summarises our work and discusses future work

2 Our company

Hermes is a leader company in the translation in-dustry in Spain It was founded in 1991 and has

1

a vast experience in multilingual localisation andtranslation projects With offices in Madrid andMaacutelaga the company has broad knowledge andexperience in computer-assisted translation soft-ware and specific localisation software includingSDL Trados memoQ Deacutejagrave Vu IBM Translation-Manager Star Transit WordFast Catalyst Pas-solo Across Idiom World Server Microsoft He-lium and Microsoft Localisation Studio amongothers

Our company objectives are built upon a solidfoundation which have allowed us to achieve adouble quality certification for translation services(UNE-EN-150382006 and ISO-90012008 stan-dards) Our in-house translators and project man-agers commit daily to provide our clients with highquality translations for different specialised fields(IT medicine technical manuals general textsetc)

3 Internal survey

We asked our in-house translators and projectmanagers for potential new functionalities thatCAT tools could offer them More concretelywe asked them which was according to themthe missing functionality as far as TMs are con-cerned In total 10 project manager and 14 trans-lators participated in this internal survey Whilenot all had a clear idea of what could be imple-mented some interesting suggestions came up

We gathered ideas such as scheduling automaticTM reorganisation to prevent that large TMs endup corrupted It is a well known fact that largeTMs need to be periodically reorganized (ie re-indexed and eventually cleaned-up) While thisfeature is available in standard CAT tools such asStudio 20143 and memoQ4 it is not always car-ried out automatically and it is difficult to estimate

3In previous versions of Studio TM reorganisationwas available for all types of TMs in the Transla-tion Memory Settings Dialog Box gt Performance andTuning As of Studio 2014 file-based TMs are au-tomatically reorganised while server-based TMs re-quire periodical reorganisation For further informa-tion see httpproducthelpsdlcomSDL20Trados20Studioclient_enRefO-TTM_SettingsTranslation_Memory_Settings_Dialog_Box__Performance_and_Tuninghtm

4memoQ actually has a repairing function the Trans-lation memory repair wizard which aims at repairing(ie re-indexing) a corrupted TM According to thedocumentation it is also possible to run this functionon TMs which are not corrupted For further infor-mation see httpkilgraycommemoq2015help-enindexhtmlrepair_resource3html

when such reorganisation should be carried outScheduling it to be run automatically when a par-ticular number of segments (eg 500) have beenadded since the last reorganisation for instancewould prevent the loss of a TM because of badmaintenance

Ideas more related to NLP included the auto-matic correction of orthotypography in the tar-get language and allowing for multilingual TMswhere several source and target languages can beused for concordance searches at once Althoughcurrently it is possible to use multilingual TMs inCAT tools when starting a new project a languagepair has to be selected This leads to an underusageof the TM and the potential benefits of queryingmultiple languages at once are missed If for in-stance the TM contains a translation unit (TU) fora different pair of languages (eg German gt Ital-ian) than the ones selected for that specific project(eg German gt French) and the same source sen-tence is appearing in the text currently being trans-lated the translation into a different target lan-guage will not be shown (ie the Italian translationof the German TU will not be matched) The samewould occur with concordance searches Whilematches for a different language pair will not beused in the translation they may be useful for car-rying it out If a translator understands other targetlanguages these translations may give them a hintas to how to translate the same sentence into theirmother tongue5

Finally an interesting idea was to generate newsegments on the fly from fragments of previouslytranslated segments Flanagan (2015) offers aninteresting overview about the techniques usedby different CAT tools for subsegment matchingHere we will focus on memoQrsquos such functional-ity fragment assembly6

Figure 1 shows how this functionality worksin memoQ As may be observed memoQ looks

5For SDL Studio there seems to be an external appAnyTM that allows users to use TMs having differentlanguage pairs than the ones in the current translationproject As of Studio 2015 this app has been integrated inthe CAT tool and become a new feature However this tooldoes not seem to support the usage of truly multilingual TMs(TMs including several target languages for each segment)For further information on the tool see httpwwwtranslationzonecomopenexchangeappsdltradosstudioanytmtranslationprovider-669html)

6For further information see httpkilgraycommemoq2015help-enindexhtmlfragment_assemblyhtml

2

for fragments of the source sentence in other TMsegments and internally computes their alignmentprobabilities It then inserts the translations intothe source segment and suggests this new some-times partially translated sentence as a match Al-ternatively only the fragments translated will beinserted in the target segment one after anotherwithout the source sentence words that could notbe retrieved

Figure 1 memoQrsquos fragment assembly function-ality

One limitation of this functionality is that thefragment translations follow the order in whichthey appear in the source language Thus whileit may be very useful for pairs of languages whichfollow a similar structure it may be problematicfor pairs of languages which require reorderingmemoQ uses the frequencies of apparition of thefragments to select one translation or another foreach particular segment As a consequence insome cases the translation selection is wrong thusyielding wrong translation suggestions

Examples 1 and 2 show two similar sentencesin our TM

(1) EN The following message then appearsClick accept to run the programES Aparece el siguiente mensaje Hagaclick en aceptar para ejecutar el programa

(2) EN The window will show the followingThe application will closeES La ventana mostraraacute lo siguiente Secerraraacute la aplicacioacuten

Now imagine we need to translate the sentencein 3

(3) EN The following message then appearsThe application will close

Taking the part of 1 before the colon and the partof 2 after the colon we would be able to producethe right translation as shown in 4

(4) ES Aparece el siguiente mensaje Secerraraacute la aplicacioacuten

In technical texts it is often the case that situa-tions like the one just described happen more thanonce Thus it is not surprising that the transla-tors liked the idea and thought it would be nice tofind a way of automatically retrieving their trans-lations without having to do concordance searchesin the TM Moreover remembering that a partic-ular fragment of a segment had been translatedin the past is not always possible as translatorsmay have forgotten it or different translators mayhave been involved in the project thus not seeingfragments of a segment that other translators havetranslated already

As this idea seemed to have a great potential toincrease the number of TM and fuzzy matches wedecided to implement it and test whether it actu-ally worked The next Section (4) explains howwe proceeded

4 The new segment generator

As explained in Section 3 we decided to test oneidea originated from our internal survey The ideawas to generate new TM segments from fragmentsof already existing segments We called our newtool new segment generator

The first step was to assess the type of textsthat are translated in our company and identify thesegment fragments that could be easily extractedUpon analysis of several sample texts we identi-fied 7 different types of fragments we could workwith

1 Ellipsis2 Colons3 Parenthesis4 Square brackets5 Curly braces6 Quotations7 Text surrounded by tags

In the following subsections (41 ndash 43) we de-scribe how each of these types of fragments wastreated

3

41 Ellipsis and colonsOne possible type of segment would be that inwhich an ellipsis () or a colon () is used inthe middle of the segment In software localisationor user guides sentences such as the ones in 5 and6 could appear

(5) EN Installing new services ServiceXXXX for premium clients [2]ES Instalando nuevos servicios ServicioXXXX para clientes premium [2]

(6) EN You can use the line [abcdef] todescribe any of the following characters ab c d e fES Puede utilizar la liacutenea [abcdef] paradescribir cualquiera de los siguientescaracteres a b c d e f

If a different segment only including the textbefore the ellipsis appears as in Example 7 theTM may not retrieve any fuzzy match The samewould occur with other sentences with colons inwhich the fragment before or after the colon ap-pears

(7) Installing new services

In these cases we proceded as follows

1 Check that there is an ellipsis a colon onboth the source and the target segment

2 Split the segment in two being the first partthe fragment of the segment up to the ellip-siscolon and the second part the fragment ofthe segment after the ellipsiscolon

3 Create a new TM segment for each fragment

42 Parenthesis square brackets and curlybraces

Sometimes a sentence includes a fragment be-tween parethesis square brackets or curly bracesThe content within such characters may constitutea new segment on its own or appear in a differ-ent sentence At the same time it may also be thecase that the same sentence appears in the text butwithout such parenthesis When sentences like theones in Examples 8ndash10 appear it may thus be de-sirable to store the translation of the fragment be-tween the aforementioned characters and the sen-tence without such content

(8) EN Creates an installation package forapplication installation (if it was not createdearlier)ES Crea un paquete de instalacioacuten para lainstalacioacuten de la aplicacioacuten (si no se creoacuteantes)

(9) EN ltreturn code 1gt=[ltdescriptiongt]ES ltcoacutedigo de retorno 1gt=[ltdescripcioacutengt]

(10) EN Could not open key [2] Systemerror [3] Verify that you have sufficientaccess to that key or contact your supportpersonnelES Error al abrir la clave [2] Error en elsistema [3] Compruebe que dispone delos derechos de acceso

The strategy to create new segments was the fol-lowing

1 Check that there is content between parenthe-sis square brackets curly braces on both thesource and the target segment

2 Keep three fragments out of each sentence

(a) A sentence removing those charactersand the content between them

(b) A fragment starting at the opening char-acter and finishing on the closing oneand including the content within In thisfragment the parenthesis square brack-ets or curly braces are mantained

(c) A fragment containing only the con-tent withing those characters (parenthe-sis square brackets or curly braces) butwithout them

3 Create a new TM segment for each fragment

At this preliminary stage we considered thatwhen a sentence has several clauses in parenthesissquare brackets or curly braces these appear in thesame order in the target language This was doneso because for the type of texts used so far to testour application (software manuals) and the pair oflanguages used (English into Spanish) this seemsto be the usual case In future work we plan to fur-ther evaluate this issue and consider other waysof ensuring that the right translation is assigned toeach clause

4

43 Quotations and text within tagsQuotations and double tags appearing in the textwere handled differently As the text within thequotations or tags might be part of the sentencewhere it appears it could not be removed withoutadding too much noise to the data Thus we iden-tified sentences with quotations andor tags wethen removed the quotations andor tags and keptthe same sentence without them as a new segmentFinally we also kept the text within the quotationmarks or tags as new segments Examples 11ndash12illustrate this kind of segments

(11) EN You can only set the values of settingsthat the policy allows to be modified that isunlocked settingsES Solo se pueden establecer los valoresde los paraacutemetros que al directiva permitemodificar es decir los paraacutemetrosdesbloqueados

(12) EN If you clear the lt1gtInherit settingsfrom parent policylt1gt check box in thelt2gtSettings inheritancelt2gt section of thelt3gtGenerallt3gt section in the propertieswindow of an inherited policy the lockis lifted for that policyES Si anula la seleccioacuten de la casillalt1gtHeredar configuracioacuten de la directivaprimarialt1gt en la seccioacuten lt2gtHerencia deconfiguracioacutenlt2gt que aparece en la seccioacutenlt3gtGenerallt3gt de la ventana depropiedades de una directiva heredada seabriraacute el candado para esa directiva

5 Pilot test

Before integrating our system in a CAT tool andin our normal production workflows we deemedit better to run a pilot test The aim of this testwas to measure to which extent the new segmentsretrieved an increased number of 100 and fuzzymatches

51 Test setWe used as a test set a real translation project com-ing from one of our clients It is a software manualwritten in English and to be translated into Span-ish We selected memoQ 2015 to be the CAT toolused for our testing because it is one of the com-mon CAT tools used by our translators and be-cause we also wanted to measure the impact of

our approach when using its fragment assemblyfunctionality

The project had in total 425 segments account-ing for 6280 words according to memoQ Ta-ble 1 shows the project statistics as provided bymemoQrsquos analysis tool using the project TM pro-vided by the client Additionally memoQ iden-tified 36 segments (418 words) which could betranslated benefiting from its fragment assemblyfunctionality which uses fragments of segments tocreate new translations

TM match Words SegmentsRepetitions 1064 80

100 0 095-99 4 285-94 0 075-84 285 14

50-74 2523 187No Match 2404 142

Total 6280 425

Table 1 Project statistics according to memoQ us-ing the project TM

Taking this analysis as the starting point of ourpilot test we generated new segments using theapproach described in Section 4 We used threedifferent TMs to further assess whether the sizeof the translation memory matters for generatingtranslations of segments and retrieving more trans-lations The first TM (Project TM) was the projectTM provided by the client The second TM (Prod-uct TM) was a TM including all projects done forthe same product of the client Finally the thirdTM (Client TM) included all projects of that clientand thus was the biggest one Table 2 summarisesthe size of the three TMs

Segments WordsEN ES

Project TM 16842 212472 244159456631

Product TM 20923 274542 317797592339

Client TM 256099 3427861 39517327379593

Table 2 Size of the different TMs used

We then generated new TM segments and storedthem as new TMs Table 3 shows the number ofnew segments generated using our approach

Table 4 breaks down the number of segmentsgenerated using each strategy and for each TM

5

Segments EN ESProject TM 6776 56973 66297

new segments 123270Product TM 7760 71034 83125

new segments 154159new Client TM 74041 662714 769705new segments 1432419

Table 3 Size of the new TMs generated using ourapproach

used As can be observed some types are moreproductive than others When using the smallerTMs (project and product) the most prolific seg-ment generator category was the one which ex-tracted text surrounded by tags However whenusing the whole clientrsquos TM the text betweenparenthesis was more prolific

TM Proj TM Prod TM clientEllipsis 7 6 50Colon 1801 2094 17894Parenthesis 1361 1621 29637Square bra 78 73 598Curly braces 0 0 0Quotations 1085 1146 8797Tags 2523 2892 17792

Table 4 Number of newly generated segments pertype and TM used

We then tested how many segments would beretrieved using our newly created TMs both aloneand in combination with the TMs we previouslyhad MemoQ offers the possibility of activatingand deactivating different TMs when preparing afile for translation We thus prepared the projectfile using 11 combinations to assess which com-bination performed better as well as whether thenew TMs where having any impact in the projectThese 11 scenarios were the following

1 TM1 The project TM as provided by theclient

2 TM2 Only the new segments generatedfrom the project TM provided by the client

3 TM3 A combination of the project TM andthe new segments retrieved from it

4 TM4 Only the new segments generatedfrom the product TM

5 TM5 The project TM and the new segmentsgenerated from the product TM

6 TM6 The project TM combined withthe new segments TMs generated from theproject TM and the ones from the productTM

7 TM7 Only the new segments generatedfrom the client TM

8 TM8 The project TM combined with thenew segments generated from the client TM

9 TM9 The project TM combined with thenew TMs generated from the client TM andthe ones from the project TM

10 TM10 The project TM combined with thenew segments generated from the client TMand the ones from the product TM

11 TM11 The project TM combined with thenew segments generated from the client TMthe ones from the project TM and the onesfrom the product TM

The preparation of a file for translation typ-ically includes both analysing the file and pre-translating it When the fragment assembly func-tionality from memoQ is activated informationabout how many segments could be translated us-ing fragments is also provided Tables 5 and 6summarise the results we obtained for each TMenvironment when preparing the project for trans-lation

As can be observed using the TMs with newsegments decreased in all cases the number of seg-ments not started and increased the number of seg-ments translated using fragments (cf Table 5) Italso seems clear that size matters and that the big-ger the TM with new fragments the higher thenumber of segments that benefit from fragments(cf TM2 TM4 and TM7 in Table 5)

However this does not hold true for the pre-translation In all cases in which the new TMswere used in isolation (TM2 TM4 and TM7) thenumber of pretranslated segments decreases Thismay be due to the fact that those TMs only containfragments of the original segments present in thedifferent TMs used to generate the segments

When combined with the project TM the num-ber of pretranslated segments increases (cf TM3TM5 TM6 and TM8-TM11) The best overall re-sults are obtained when using the project TM ei-ther in combination with both the new TM gener-ated from the project TM and the new TM gen-

6

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194

Table 5 Overview on the number of segments pre-translated translated using fragments or not started

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11

100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87

Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band

erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)

If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band

In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations

from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators

6 Conclusion and future work

In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent

We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators

We have also envisioned the combination ofalready existing methods for retrieving a higher

7

number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)

Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesKevin Flanagan 2015 Subsegment recall in Transla-

tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January

Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)

Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT

Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria

Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008

Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters

Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas

Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT

8

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015

Spotting false translation segments in translation memories

Eduard BarbuTranslatednet

eduardtranslatednet

Abstract

The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories

1 Introduction

MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface

The quality of the translations using the firstmethod is high and the errors are relatively few

1httpsmymemorytranslatednet

6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V

Figure 1 The distribution of automatically ac-quired memories in MyMemory

However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions

The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task

The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments

9

2 Related Work

The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval

There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora

A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours

2httpwwwxbenchnet

3 The data

In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types

1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web

2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)

3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments

4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)

The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors

10

It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1

CG =ls minus ldradic

34(ls + ld)(1)

In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments

If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426

The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the

3This simple idea is implemented in many sentence align-ers

4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom

Church Gale ScoreP

rob

ab

ility C

hu

rch

Ga

le S

co

re

minus4 minus2 0 2 4

00

02

04

06

08

Figure 2 The distribution of Church Gale Scoresin the Matecat Set

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

01

02

03

04

Figure 3 The distribution of Church Gale Scoresin the Collaborative Set

11

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

00

00

04

00

08

Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet

Church Gale score varies in the interval minus131516015

To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6

In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows

bull Training Set It contains 1243 bi-segmentsand has 373 negative example

bull Test Set It contains 309 bi-segments and has87 negatives examples

The proportion of the negative examples in bothsets is approximately 30

minus4 minus2 0 2 4

minus4

minus2

02

4Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 5 The Q-Q plot for the Matecat set

minus4 minus2 0 2 4

minus1

00

minus5

00

50

Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 6 The Q-Q plot for the Collaborative set

12

4 Machine Learning

In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them

41 FeaturesThe features computed for the training and test setare the following

bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt

bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language

bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0

bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0

bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0

bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0

bull has capital letters The value of the featureis 1 if the source or target segments contain

words that have at least a capital letter other-wise is 0

bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters

bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions

bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)

bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties

bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity

bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity

13

bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment

bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters

bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters

bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2

All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation

The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment

is done with the highly accurate language detectorCybozu5

42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)

bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree

bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees

bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores

bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms

bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances

bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can

5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd

14

give good results in classification problemswhere the decision boundary is irregular

5 Results and discussion

We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1

Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071

Table 1 The results of the three-fold stratifiedclassification

Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)

The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2

The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set

Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059

Table 2 The results of the classification on the testset

Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section

6 Conclusions and further work

In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the

15

performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database

There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative

Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471

ReferencesWilliam A Gale and Kenneth W Church 1993 A

program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS

B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April

Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-

lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31

Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830

Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226

Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory

Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas

16

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015

Abstract

We propose the integration of clause

splitting as a pre-processing step for match

retrieval in Translation Memory (TM)

systems to increase the number of relevant

sub-segment matches Through a series of

experiments we investigate the impact of

clause splitting in instances where the input

does not match an entire segment in the

TM but only a clause from a segment Our

results show that there is a statistically

significant increase in the number of

retrieved matches when both the input

segments and the segments in the TM are

first processed with a clause splitter

1 Rationale

Translation memory tools have had a great

impact on the translation industry as they

provide considerable assistance to translators

They allow translators to easily re-use

previous translations providing them with

valuable productivity gains in an industry

where there is a great demand for quality

translation delivered in the shortest possible

time However existing tools have some

shortcomings The majority of existing tools

rely on Levenshtein distance and seek to

identify matches only at the sentence level

Semantically similar segments are therefore

difficult to retrieve if the string similarity is not

high enough as are sub-segment matches

because if only part of a sentence matches the

input even if this part is an entire clause it is

unlikely that this sentence would be retrieved

(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text

types such as updated versions of technical

manuals

In this study we aim to address the

problem of retrieving sub-segment matches by

performing clause splitting on the source

segment as a pre-processing step for TM match

retrieval While matches for entire sentences

or almost entire sentences are the most useful

type of matches it is also less likely for such

matches to be found in most text types and

even less so for complex sentences Retrieving

clauses is desirable because there is a higher

chance for a match to be found for a clause than

for a complex sentence and at the same time

clauses are similar to sentences in that they

both contain a subject and a verb hence a

ldquocomplete thoughtrdquo therefore clause matches

are more likely to be in context and to actually

be used by the translator than phrase matches

for example

We perform experiments comparing the

match retrieval performance of TM tools when

they are used as is and when the input file and

the TM segments are first processed with a

clause splitter before being fed into the TM

tool The paper is organised as follows In

section 2 we discuss related work on TM

matching In section 3 we discuss how clause

splitting can be beneficial to TM matching and

how clause splitting was implemented for this

study We then describe our experiments in

section 4 and discuss the results in section 5

Finally we present our conclusion and future

work in section 6

Improving Translation Memory Matching through Clause

Splitting

Katerina Timonera

Research Group in Computational

Linguistics

University of Wolverhampton

krtimoneragmailcom

Ruslan Mitkov

Research Group in Computational

Linguistics

University of Wolverhampton

rmitkovwlvacuk

17

2 Related Work

Attempts to address the shortcomings of

existing tools include the integration of

language processing to break down a sentence

into smaller segments The so-called lsquosecond

generationrsquo TM system Similis (Planas 2005)

performs chunking to split sentences into syntagmas to allow sub-sentence matching

However Reinke (2013) observes that for

certain language pairs like English-German

only rather short phrases like simple NPs are

identified and larger syntactic units cannot be

retrieved This can be regarded as disadvantage

as the processing of larger units would be

desirable for the support of professional

computer-assisted human translation (Kriele

(2006) and Macken (2009) cited in Reinke

2013)

MetaMorphoTM (Hodaacutesz and Pohl 2005)

also divides sentences into smaller chunks Moreover it uses a multi-level linguistic

similarity technique (surface form lemma

word class) to determine similarity between

two source-language segments

Other attempts involve deeper linguistic

processing techniques In Pekar and Mitkov

(2007) we propose the lsquothird-generation

translation memoryrsquo which introduces the

concept of semantic matching We employ

syntactic and semantic analysis of segments

stored in a TM to produce a generalised

representation of segments which reduces

equivalent lexical syntactic and lexico-

syntactic constructions into a single

representation Then a retrieval mechanism

operating on these generalised

representations is used to search for useful

previous translations in the TM

This study is part of the third-generation

translation memory project of which the

ultimate goal is to produce more intelligent TM

systems using NLP techniques To the best of

our knowledge clause splitting has not

previously been investigated as a possible

method for increasing the number of relevant

retrieved matches

3 Clause splitting for TM matching

Macklovitch and Russel (2000) note that

when a sufficiently close match cannot be

found for a new input sentence current TM

systems are unable to retrieve sentences that

contain the same clauses or other major

phrases

For example (a) below is a new input

sentence composed of twenty five-character

words The TM contains the sentence (b)

which shares an identical substring with

sentence (a) However as this substring only

makes up only 25 of the sentences total

number of characters it is unlikely that current

TM tools would be able to retrieve it as a fuzzy

match

(a) w1 w2 w3 w4 w5 w6 w20

(b) w1 w2 w3 w4 w5 w21 w35

If clause splitting were employed clauses

would be treated as separate segments thus

increasing the likelihood that clauses which

are subparts of larger units could have a match

score sufficiently high to be retrieved by the

TM system

In this paper we compare the effect of

performing clause splitting before retrieving

matches in a TM In each experiment we

identify the difference in matching

performance when a TM tool is used as is and

when the input file and the translation memory

are first run through a clause splitter The

clause splitter we use in this study is a modified version of the one described in Puscasu (2004)

The original version employs both machine

learning and linguistic rules to identify finite

clauses for both English and Romanian but in

this version only the rule-based module is

used Puscasu (2004) developed a clause

splitting method for both English and

Romanian and to maintain consistency

between the two languages her definition of a

clause is the one prescribed by the Romanian

Academy of Grammar which is that a clause is

18

group of words containing a finite verb

Nonfinite and verbless clauses are therefore

not considered The reported F-measure for

identifying complete clauses in English is

8139 (Marsic 2011)

In this study the clause splitter is used on

both the segments in the input file and the translation memory database After processing

the input and the TM segments with the clause

splitter these were then imported into existing

TM tools to examine how well these tools will

perform if clause splitting is used in pre-

processing

4 Experiments

Experiments were performed to study the

impact of clause splitting when used in pre-

processing for the retrieval of segments in a

TM Our hypothesis is that when a clause

splitter is used the number of relevant

retrieved matches will increase

The effect of clause splitting is examined by

comparing the number of matches retrieved

when TM tools are used as is and when both

the input segments and the segments in the

translation memory are first processed with a

clause splitter before being imported into the

TM tools The tools used are Wordfast

Professional 3 and Trados Studio 2009 which

are among the most widely used TM tools

(Lagoudaki 2006)

Segments used as the input were selected

from the Edinburgh paraphrase corpus (Cohn

Callison-Burch and Lapata 2008) (in Macken

2009) We use a paraphrase corpus because we

wish to investigate the effect of using a clause

splitter in pre-processing to retrieve both

segments that contain the entire input clause

and segments that do not contain the exact

input clause but may still be relevant as they

contain a clause that shares a considerable

degree of similarity with the input

We examine the segments retrieved using

both the default fuzzy match threshold (75

for Wordfast and 70 for SDL Trados) and the

minimum threshold (40 for Wordfast and

30 for Trados) It is not normally

recommended for translators to set a low fuzzy

match threshold as this might result in the

retrieval of too many irrelevant segments if the

translation memory is large However in this

study we argue it would be beneficial to

examine matches retrieved with the minimum

threshold as well Given that translation

memory match scores are mainly calculated

using Levenshtein distance if only one clause

in a segment in the TM matches the input there

is a greater chance of the segment being

retrieved with a lower threshold We therefore

wish to examine whether the employment of

clause splitting will still result in a considerable

improvement from using the Levenshtein

distance-based matching algorithm in most TM

tools if the match threshold setting is already

optimised for the retrieval of sub-segment

clauses

It must also be noted that for this study we

are working with the source segments only

Therefore in the TM files used both the source

and target segments are in English

We conducted two main sets of experiments

referred to as Set A and Set B which are

outlined below In Set A we selected sentences

from the Edinburgh corpus that contained

more than one clause We use one clause or

part of it as the input segment and we store

the entire sentence in the TM In the

experiments where no clause splitting is done

the sentence is stored as is In the experiments

with clause splitting the original input

segments are split into clauses (if there are more than one) and the segments in the TM are

the component clauses of the original sentence

For the experiments done without clause

splitting there are 150 input segments and for

each one we test whether the longer

corresponding segment in the TM can be

retrieved For the experiments where clause

splitting is used the 150 input segments are

split into 180 segments as some of these

segments contain more than one clause We

then test whether the corresponding clause

from the original longer sentence can be

retrieved An example is presented in Table 1

19

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 4: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

without the support received from the EXPERT project (FP72007-2013 under REA grant agreement no317471 httpexpert-itneu)

Constantin Orasan and Rohit Gupta

iv

Organizers

Constantin Orasan University of Wolverhampton UK

Rohit Gupta University of Wolverhampton UK

Program Committee

Manuel Arcedillo Hermes SpainJuanjo Arevalillo Hermes SpainEduard Barbu Translated ItalyYves Champollion WordFast FranceGloria Corpas University of Malaga SpainMaud Ehrmann EPFL SwitzerlandKevin Flanagan Swansea University UKGabriela Gonzalez eTrad ArgentinaManuel Herranz Pangeanic SpainQun Liu DCU IrelandRuslan Mitkov University of Wolverhampton UKGabor Proszeky Morphologic HungaryUwe Reinke Cologne University of Applied Sciences GermanyMichel Simard NRC CanadaMark Shuttleworth UCL UKMasao Utiyama NICT JapanAndy Way DCU IrelandMarcos Zampieri Saarland University and DFKI GermanyVentsislav Zhechev Autodesk

Invited Speaker

Marcello Federico Fondazione Bruno Kessler Italy

v

Table of Contents

Creation of new TM segments Fulfilling translatorsrsquo wishesCarla Parra Escartiacuten 1

Spotting false translation segments in translation memoriesEduard Barbu 9

Improving Translation Memory Matching through Clause SplittingKaterina Raisa Timonera and Ruslan Mitkov 17

Improving translation memory fuzzy matching by paraphrasingKonstantinos Chatzitheodoroou 24

Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Meth-ods

Viacutet Baisa Ales Horak and Marek Medvedrsquo 31

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek Sudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Vela and Josef van

Genabith 36

vii

Conference Program

900ndash910 WelcomeConstantin Orasan

910ndash1010 Automatically tidying up and extending translation memories (invited talk)Marcello Federico FBK Italy

1015ndash1100 Creation of new TM segments Fulfilling translatorsrsquo wishesCarla Parra Escartiacuten

1130ndash1215 Spotting false translation segments in translation memoriesEduard Barbu

1215ndash1300 Improving Translation Memory Matching through Clause SplittingKaterina Raisa Timonera and Ruslan Mitkov

1430ndash1500 Improving translation memory fuzzy matching by paraphrasingKonstantinos Chatzitheodoroou

1500ndash1530 Increasing Coverage of Translation Memories with Linguistically Motivated Seg-ment Combination MethodsViacutet Baisa Ales Horak and Marek Medvedrsquo

1530ndash1600 CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek Sudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Velaand Josef van Genabith

1615ndash1700 Round table and final discussion

ix

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 1ndash8Hissar Bulgaria Sept 2015

Creation of new TM segments Fulfilling translatorsrsquo wishes

Carla Parra EscartiacutenHermes Traducciones

C Coacutelquide 6 portal 2 3ž I28230 Las Rozas Madrid Spain

carlaparrahermestranscom

Abstract

This paper aims at bridging the gap be-tween academic research on TranslationMemories (TMs) and the actual needs andwishes of translators Starting from aninternal survey in a translation companywe analyse what translators wished trans-lation memories offered them We are cur-rently implementing one of their sugges-tions to assess its feasibility and whetheror not it retrieves more TM matches asoriginally expected We report on how thesuggestion is being implemented and theresults obtained in a pilot study

1 Introduction

Professional translators use translation memorieson a daily basis In fact most translation projectsnowadays require the usage of Computer AssistedTranslation (CAT) tools Sometimes translatorswill freely choose to work with a particular tooland sometimes it will be the client who imposesthe usage of such tool One of the main advantagesof CAT tools is that they allow for the integrationof different productivity enhancement features

Translation Memories (TMs) are used to re-trieve past translations and partial past translations(fuzzy matches) Terminology databases (TBs) areused to enhance the coherence on terminology andto ensure that the right terminology is used in allprojects Moreover some CAT tools offer addi-tional functionalities such as the usage of predic-tive texts (Autosuggest in SDL Trados Studio1

and Muses in MemoQ2) the automatic assem-bly of fragments to produce translations of newsegments or specific customizable Quality As-surance (QA) features

1wwwsdlcom2wwwmemoqcom

In the context of a translation company the us-age of these productivity enhacement tools is partof the whole translation workflow Project man-agers use them to generate word counts and esti-mate the time and resources needed to make thetranslation They also use them to pre-translatefiles and clean them up prior to delivery to theclient Translators use these tools to translate re-vise and proofread the translations prior to finaldelivery

Several researchers have worked on enhancingTMs using Natural Language Processing (NLP)techniques (Hodaacutesz and Pohl 2005 Pekar andMitkov 2007 Mitkov and Corpas 2008 Simardand Fujita 2012 Gupta and Orasan 2014Vanallemeersch and Vandeghinste 2015 Guptaet al 2015) Despite reporting positive resultsit seems that the gap between academic researchand industry still exists We carried out an inter-nal survey in our company to detect potential newfeatures for TMs that could be implemented byour RampD team The aim was to identify potentialnew features that project managers and translatorswished for and that would enhance their produc-tivity In this paper we report on the results of thatsurvey and further explain how we are implement-ing one of the suggestions we received The re-mainder of this paper is structured as follows Sec-tion 3 summarises the survey we carried out andthe replies we got and our company is briefly in-troduced in Section 2 Section 4 explains how weimplemented one of the suggestions we receivedSubsections 41ndash43 describe how we created newTM segments out of existing TMs Section 5 re-ports on the evaluation on a pilot test to assess thereal usability of this approach Finally section 6summarises our work and discusses future work

2 Our company

Hermes is a leader company in the translation in-dustry in Spain It was founded in 1991 and has

1

a vast experience in multilingual localisation andtranslation projects With offices in Madrid andMaacutelaga the company has broad knowledge andexperience in computer-assisted translation soft-ware and specific localisation software includingSDL Trados memoQ Deacutejagrave Vu IBM Translation-Manager Star Transit WordFast Catalyst Pas-solo Across Idiom World Server Microsoft He-lium and Microsoft Localisation Studio amongothers

Our company objectives are built upon a solidfoundation which have allowed us to achieve adouble quality certification for translation services(UNE-EN-150382006 and ISO-90012008 stan-dards) Our in-house translators and project man-agers commit daily to provide our clients with highquality translations for different specialised fields(IT medicine technical manuals general textsetc)

3 Internal survey

We asked our in-house translators and projectmanagers for potential new functionalities thatCAT tools could offer them More concretelywe asked them which was according to themthe missing functionality as far as TMs are con-cerned In total 10 project manager and 14 trans-lators participated in this internal survey Whilenot all had a clear idea of what could be imple-mented some interesting suggestions came up

We gathered ideas such as scheduling automaticTM reorganisation to prevent that large TMs endup corrupted It is a well known fact that largeTMs need to be periodically reorganized (ie re-indexed and eventually cleaned-up) While thisfeature is available in standard CAT tools such asStudio 20143 and memoQ4 it is not always car-ried out automatically and it is difficult to estimate

3In previous versions of Studio TM reorganisationwas available for all types of TMs in the Transla-tion Memory Settings Dialog Box gt Performance andTuning As of Studio 2014 file-based TMs are au-tomatically reorganised while server-based TMs re-quire periodical reorganisation For further informa-tion see httpproducthelpsdlcomSDL20Trados20Studioclient_enRefO-TTM_SettingsTranslation_Memory_Settings_Dialog_Box__Performance_and_Tuninghtm

4memoQ actually has a repairing function the Trans-lation memory repair wizard which aims at repairing(ie re-indexing) a corrupted TM According to thedocumentation it is also possible to run this functionon TMs which are not corrupted For further infor-mation see httpkilgraycommemoq2015help-enindexhtmlrepair_resource3html

when such reorganisation should be carried outScheduling it to be run automatically when a par-ticular number of segments (eg 500) have beenadded since the last reorganisation for instancewould prevent the loss of a TM because of badmaintenance

Ideas more related to NLP included the auto-matic correction of orthotypography in the tar-get language and allowing for multilingual TMswhere several source and target languages can beused for concordance searches at once Althoughcurrently it is possible to use multilingual TMs inCAT tools when starting a new project a languagepair has to be selected This leads to an underusageof the TM and the potential benefits of queryingmultiple languages at once are missed If for in-stance the TM contains a translation unit (TU) fora different pair of languages (eg German gt Ital-ian) than the ones selected for that specific project(eg German gt French) and the same source sen-tence is appearing in the text currently being trans-lated the translation into a different target lan-guage will not be shown (ie the Italian translationof the German TU will not be matched) The samewould occur with concordance searches Whilematches for a different language pair will not beused in the translation they may be useful for car-rying it out If a translator understands other targetlanguages these translations may give them a hintas to how to translate the same sentence into theirmother tongue5

Finally an interesting idea was to generate newsegments on the fly from fragments of previouslytranslated segments Flanagan (2015) offers aninteresting overview about the techniques usedby different CAT tools for subsegment matchingHere we will focus on memoQrsquos such functional-ity fragment assembly6

Figure 1 shows how this functionality worksin memoQ As may be observed memoQ looks

5For SDL Studio there seems to be an external appAnyTM that allows users to use TMs having differentlanguage pairs than the ones in the current translationproject As of Studio 2015 this app has been integrated inthe CAT tool and become a new feature However this tooldoes not seem to support the usage of truly multilingual TMs(TMs including several target languages for each segment)For further information on the tool see httpwwwtranslationzonecomopenexchangeappsdltradosstudioanytmtranslationprovider-669html)

6For further information see httpkilgraycommemoq2015help-enindexhtmlfragment_assemblyhtml

2

for fragments of the source sentence in other TMsegments and internally computes their alignmentprobabilities It then inserts the translations intothe source segment and suggests this new some-times partially translated sentence as a match Al-ternatively only the fragments translated will beinserted in the target segment one after anotherwithout the source sentence words that could notbe retrieved

Figure 1 memoQrsquos fragment assembly function-ality

One limitation of this functionality is that thefragment translations follow the order in whichthey appear in the source language Thus whileit may be very useful for pairs of languages whichfollow a similar structure it may be problematicfor pairs of languages which require reorderingmemoQ uses the frequencies of apparition of thefragments to select one translation or another foreach particular segment As a consequence insome cases the translation selection is wrong thusyielding wrong translation suggestions

Examples 1 and 2 show two similar sentencesin our TM

(1) EN The following message then appearsClick accept to run the programES Aparece el siguiente mensaje Hagaclick en aceptar para ejecutar el programa

(2) EN The window will show the followingThe application will closeES La ventana mostraraacute lo siguiente Secerraraacute la aplicacioacuten

Now imagine we need to translate the sentencein 3

(3) EN The following message then appearsThe application will close

Taking the part of 1 before the colon and the partof 2 after the colon we would be able to producethe right translation as shown in 4

(4) ES Aparece el siguiente mensaje Secerraraacute la aplicacioacuten

In technical texts it is often the case that situa-tions like the one just described happen more thanonce Thus it is not surprising that the transla-tors liked the idea and thought it would be nice tofind a way of automatically retrieving their trans-lations without having to do concordance searchesin the TM Moreover remembering that a partic-ular fragment of a segment had been translatedin the past is not always possible as translatorsmay have forgotten it or different translators mayhave been involved in the project thus not seeingfragments of a segment that other translators havetranslated already

As this idea seemed to have a great potential toincrease the number of TM and fuzzy matches wedecided to implement it and test whether it actu-ally worked The next Section (4) explains howwe proceeded

4 The new segment generator

As explained in Section 3 we decided to test oneidea originated from our internal survey The ideawas to generate new TM segments from fragmentsof already existing segments We called our newtool new segment generator

The first step was to assess the type of textsthat are translated in our company and identify thesegment fragments that could be easily extractedUpon analysis of several sample texts we identi-fied 7 different types of fragments we could workwith

1 Ellipsis2 Colons3 Parenthesis4 Square brackets5 Curly braces6 Quotations7 Text surrounded by tags

In the following subsections (41 ndash 43) we de-scribe how each of these types of fragments wastreated

3

41 Ellipsis and colonsOne possible type of segment would be that inwhich an ellipsis () or a colon () is used inthe middle of the segment In software localisationor user guides sentences such as the ones in 5 and6 could appear

(5) EN Installing new services ServiceXXXX for premium clients [2]ES Instalando nuevos servicios ServicioXXXX para clientes premium [2]

(6) EN You can use the line [abcdef] todescribe any of the following characters ab c d e fES Puede utilizar la liacutenea [abcdef] paradescribir cualquiera de los siguientescaracteres a b c d e f

If a different segment only including the textbefore the ellipsis appears as in Example 7 theTM may not retrieve any fuzzy match The samewould occur with other sentences with colons inwhich the fragment before or after the colon ap-pears

(7) Installing new services

In these cases we proceded as follows

1 Check that there is an ellipsis a colon onboth the source and the target segment

2 Split the segment in two being the first partthe fragment of the segment up to the ellip-siscolon and the second part the fragment ofthe segment after the ellipsiscolon

3 Create a new TM segment for each fragment

42 Parenthesis square brackets and curlybraces

Sometimes a sentence includes a fragment be-tween parethesis square brackets or curly bracesThe content within such characters may constitutea new segment on its own or appear in a differ-ent sentence At the same time it may also be thecase that the same sentence appears in the text butwithout such parenthesis When sentences like theones in Examples 8ndash10 appear it may thus be de-sirable to store the translation of the fragment be-tween the aforementioned characters and the sen-tence without such content

(8) EN Creates an installation package forapplication installation (if it was not createdearlier)ES Crea un paquete de instalacioacuten para lainstalacioacuten de la aplicacioacuten (si no se creoacuteantes)

(9) EN ltreturn code 1gt=[ltdescriptiongt]ES ltcoacutedigo de retorno 1gt=[ltdescripcioacutengt]

(10) EN Could not open key [2] Systemerror [3] Verify that you have sufficientaccess to that key or contact your supportpersonnelES Error al abrir la clave [2] Error en elsistema [3] Compruebe que dispone delos derechos de acceso

The strategy to create new segments was the fol-lowing

1 Check that there is content between parenthe-sis square brackets curly braces on both thesource and the target segment

2 Keep three fragments out of each sentence

(a) A sentence removing those charactersand the content between them

(b) A fragment starting at the opening char-acter and finishing on the closing oneand including the content within In thisfragment the parenthesis square brack-ets or curly braces are mantained

(c) A fragment containing only the con-tent withing those characters (parenthe-sis square brackets or curly braces) butwithout them

3 Create a new TM segment for each fragment

At this preliminary stage we considered thatwhen a sentence has several clauses in parenthesissquare brackets or curly braces these appear in thesame order in the target language This was doneso because for the type of texts used so far to testour application (software manuals) and the pair oflanguages used (English into Spanish) this seemsto be the usual case In future work we plan to fur-ther evaluate this issue and consider other waysof ensuring that the right translation is assigned toeach clause

4

43 Quotations and text within tagsQuotations and double tags appearing in the textwere handled differently As the text within thequotations or tags might be part of the sentencewhere it appears it could not be removed withoutadding too much noise to the data Thus we iden-tified sentences with quotations andor tags wethen removed the quotations andor tags and keptthe same sentence without them as a new segmentFinally we also kept the text within the quotationmarks or tags as new segments Examples 11ndash12illustrate this kind of segments

(11) EN You can only set the values of settingsthat the policy allows to be modified that isunlocked settingsES Solo se pueden establecer los valoresde los paraacutemetros que al directiva permitemodificar es decir los paraacutemetrosdesbloqueados

(12) EN If you clear the lt1gtInherit settingsfrom parent policylt1gt check box in thelt2gtSettings inheritancelt2gt section of thelt3gtGenerallt3gt section in the propertieswindow of an inherited policy the lockis lifted for that policyES Si anula la seleccioacuten de la casillalt1gtHeredar configuracioacuten de la directivaprimarialt1gt en la seccioacuten lt2gtHerencia deconfiguracioacutenlt2gt que aparece en la seccioacutenlt3gtGenerallt3gt de la ventana depropiedades de una directiva heredada seabriraacute el candado para esa directiva

5 Pilot test

Before integrating our system in a CAT tool andin our normal production workflows we deemedit better to run a pilot test The aim of this testwas to measure to which extent the new segmentsretrieved an increased number of 100 and fuzzymatches

51 Test setWe used as a test set a real translation project com-ing from one of our clients It is a software manualwritten in English and to be translated into Span-ish We selected memoQ 2015 to be the CAT toolused for our testing because it is one of the com-mon CAT tools used by our translators and be-cause we also wanted to measure the impact of

our approach when using its fragment assemblyfunctionality

The project had in total 425 segments account-ing for 6280 words according to memoQ Ta-ble 1 shows the project statistics as provided bymemoQrsquos analysis tool using the project TM pro-vided by the client Additionally memoQ iden-tified 36 segments (418 words) which could betranslated benefiting from its fragment assemblyfunctionality which uses fragments of segments tocreate new translations

TM match Words SegmentsRepetitions 1064 80

100 0 095-99 4 285-94 0 075-84 285 14

50-74 2523 187No Match 2404 142

Total 6280 425

Table 1 Project statistics according to memoQ us-ing the project TM

Taking this analysis as the starting point of ourpilot test we generated new segments using theapproach described in Section 4 We used threedifferent TMs to further assess whether the sizeof the translation memory matters for generatingtranslations of segments and retrieving more trans-lations The first TM (Project TM) was the projectTM provided by the client The second TM (Prod-uct TM) was a TM including all projects done forthe same product of the client Finally the thirdTM (Client TM) included all projects of that clientand thus was the biggest one Table 2 summarisesthe size of the three TMs

Segments WordsEN ES

Project TM 16842 212472 244159456631

Product TM 20923 274542 317797592339

Client TM 256099 3427861 39517327379593

Table 2 Size of the different TMs used

We then generated new TM segments and storedthem as new TMs Table 3 shows the number ofnew segments generated using our approach

Table 4 breaks down the number of segmentsgenerated using each strategy and for each TM

5

Segments EN ESProject TM 6776 56973 66297

new segments 123270Product TM 7760 71034 83125

new segments 154159new Client TM 74041 662714 769705new segments 1432419

Table 3 Size of the new TMs generated using ourapproach

used As can be observed some types are moreproductive than others When using the smallerTMs (project and product) the most prolific seg-ment generator category was the one which ex-tracted text surrounded by tags However whenusing the whole clientrsquos TM the text betweenparenthesis was more prolific

TM Proj TM Prod TM clientEllipsis 7 6 50Colon 1801 2094 17894Parenthesis 1361 1621 29637Square bra 78 73 598Curly braces 0 0 0Quotations 1085 1146 8797Tags 2523 2892 17792

Table 4 Number of newly generated segments pertype and TM used

We then tested how many segments would beretrieved using our newly created TMs both aloneand in combination with the TMs we previouslyhad MemoQ offers the possibility of activatingand deactivating different TMs when preparing afile for translation We thus prepared the projectfile using 11 combinations to assess which com-bination performed better as well as whether thenew TMs where having any impact in the projectThese 11 scenarios were the following

1 TM1 The project TM as provided by theclient

2 TM2 Only the new segments generatedfrom the project TM provided by the client

3 TM3 A combination of the project TM andthe new segments retrieved from it

4 TM4 Only the new segments generatedfrom the product TM

5 TM5 The project TM and the new segmentsgenerated from the product TM

6 TM6 The project TM combined withthe new segments TMs generated from theproject TM and the ones from the productTM

7 TM7 Only the new segments generatedfrom the client TM

8 TM8 The project TM combined with thenew segments generated from the client TM

9 TM9 The project TM combined with thenew TMs generated from the client TM andthe ones from the project TM

10 TM10 The project TM combined with thenew segments generated from the client TMand the ones from the product TM

11 TM11 The project TM combined with thenew segments generated from the client TMthe ones from the project TM and the onesfrom the product TM

The preparation of a file for translation typ-ically includes both analysing the file and pre-translating it When the fragment assembly func-tionality from memoQ is activated informationabout how many segments could be translated us-ing fragments is also provided Tables 5 and 6summarise the results we obtained for each TMenvironment when preparing the project for trans-lation

As can be observed using the TMs with newsegments decreased in all cases the number of seg-ments not started and increased the number of seg-ments translated using fragments (cf Table 5) Italso seems clear that size matters and that the big-ger the TM with new fragments the higher thenumber of segments that benefit from fragments(cf TM2 TM4 and TM7 in Table 5)

However this does not hold true for the pre-translation In all cases in which the new TMswere used in isolation (TM2 TM4 and TM7) thenumber of pretranslated segments decreases Thismay be due to the fact that those TMs only containfragments of the original segments present in thedifferent TMs used to generate the segments

When combined with the project TM the num-ber of pretranslated segments increases (cf TM3TM5 TM6 and TM8-TM11) The best overall re-sults are obtained when using the project TM ei-ther in combination with both the new TM gener-ated from the project TM and the new TM gen-

6

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194

Table 5 Overview on the number of segments pre-translated translated using fragments or not started

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11

100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87

Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band

erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)

If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band

In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations

from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators

6 Conclusion and future work

In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent

We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators

We have also envisioned the combination ofalready existing methods for retrieving a higher

7

number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)

Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesKevin Flanagan 2015 Subsegment recall in Transla-

tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January

Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)

Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT

Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria

Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008

Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters

Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas

Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT

8

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015

Spotting false translation segments in translation memories

Eduard BarbuTranslatednet

eduardtranslatednet

Abstract

The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories

1 Introduction

MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface

The quality of the translations using the firstmethod is high and the errors are relatively few

1httpsmymemorytranslatednet

6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V

Figure 1 The distribution of automatically ac-quired memories in MyMemory

However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions

The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task

The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments

9

2 Related Work

The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval

There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora

A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours

2httpwwwxbenchnet

3 The data

In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types

1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web

2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)

3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments

4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)

The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors

10

It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1

CG =ls minus ldradic

34(ls + ld)(1)

In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments

If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426

The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the

3This simple idea is implemented in many sentence align-ers

4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom

Church Gale ScoreP

rob

ab

ility C

hu

rch

Ga

le S

co

re

minus4 minus2 0 2 4

00

02

04

06

08

Figure 2 The distribution of Church Gale Scoresin the Matecat Set

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

01

02

03

04

Figure 3 The distribution of Church Gale Scoresin the Collaborative Set

11

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

00

00

04

00

08

Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet

Church Gale score varies in the interval minus131516015

To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6

In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows

bull Training Set It contains 1243 bi-segmentsand has 373 negative example

bull Test Set It contains 309 bi-segments and has87 negatives examples

The proportion of the negative examples in bothsets is approximately 30

minus4 minus2 0 2 4

minus4

minus2

02

4Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 5 The Q-Q plot for the Matecat set

minus4 minus2 0 2 4

minus1

00

minus5

00

50

Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 6 The Q-Q plot for the Collaborative set

12

4 Machine Learning

In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them

41 FeaturesThe features computed for the training and test setare the following

bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt

bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language

bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0

bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0

bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0

bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0

bull has capital letters The value of the featureis 1 if the source or target segments contain

words that have at least a capital letter other-wise is 0

bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters

bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions

bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)

bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties

bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity

bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity

13

bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment

bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters

bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters

bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2

All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation

The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment

is done with the highly accurate language detectorCybozu5

42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)

bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree

bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees

bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores

bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms

bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances

bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can

5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd

14

give good results in classification problemswhere the decision boundary is irregular

5 Results and discussion

We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1

Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071

Table 1 The results of the three-fold stratifiedclassification

Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)

The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2

The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set

Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059

Table 2 The results of the classification on the testset

Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section

6 Conclusions and further work

In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the

15

performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database

There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative

Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471

ReferencesWilliam A Gale and Kenneth W Church 1993 A

program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS

B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April

Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-

lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31

Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830

Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226

Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory

Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas

16

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015

Abstract

We propose the integration of clause

splitting as a pre-processing step for match

retrieval in Translation Memory (TM)

systems to increase the number of relevant

sub-segment matches Through a series of

experiments we investigate the impact of

clause splitting in instances where the input

does not match an entire segment in the

TM but only a clause from a segment Our

results show that there is a statistically

significant increase in the number of

retrieved matches when both the input

segments and the segments in the TM are

first processed with a clause splitter

1 Rationale

Translation memory tools have had a great

impact on the translation industry as they

provide considerable assistance to translators

They allow translators to easily re-use

previous translations providing them with

valuable productivity gains in an industry

where there is a great demand for quality

translation delivered in the shortest possible

time However existing tools have some

shortcomings The majority of existing tools

rely on Levenshtein distance and seek to

identify matches only at the sentence level

Semantically similar segments are therefore

difficult to retrieve if the string similarity is not

high enough as are sub-segment matches

because if only part of a sentence matches the

input even if this part is an entire clause it is

unlikely that this sentence would be retrieved

(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text

types such as updated versions of technical

manuals

In this study we aim to address the

problem of retrieving sub-segment matches by

performing clause splitting on the source

segment as a pre-processing step for TM match

retrieval While matches for entire sentences

or almost entire sentences are the most useful

type of matches it is also less likely for such

matches to be found in most text types and

even less so for complex sentences Retrieving

clauses is desirable because there is a higher

chance for a match to be found for a clause than

for a complex sentence and at the same time

clauses are similar to sentences in that they

both contain a subject and a verb hence a

ldquocomplete thoughtrdquo therefore clause matches

are more likely to be in context and to actually

be used by the translator than phrase matches

for example

We perform experiments comparing the

match retrieval performance of TM tools when

they are used as is and when the input file and

the TM segments are first processed with a

clause splitter before being fed into the TM

tool The paper is organised as follows In

section 2 we discuss related work on TM

matching In section 3 we discuss how clause

splitting can be beneficial to TM matching and

how clause splitting was implemented for this

study We then describe our experiments in

section 4 and discuss the results in section 5

Finally we present our conclusion and future

work in section 6

Improving Translation Memory Matching through Clause

Splitting

Katerina Timonera

Research Group in Computational

Linguistics

University of Wolverhampton

krtimoneragmailcom

Ruslan Mitkov

Research Group in Computational

Linguistics

University of Wolverhampton

rmitkovwlvacuk

17

2 Related Work

Attempts to address the shortcomings of

existing tools include the integration of

language processing to break down a sentence

into smaller segments The so-called lsquosecond

generationrsquo TM system Similis (Planas 2005)

performs chunking to split sentences into syntagmas to allow sub-sentence matching

However Reinke (2013) observes that for

certain language pairs like English-German

only rather short phrases like simple NPs are

identified and larger syntactic units cannot be

retrieved This can be regarded as disadvantage

as the processing of larger units would be

desirable for the support of professional

computer-assisted human translation (Kriele

(2006) and Macken (2009) cited in Reinke

2013)

MetaMorphoTM (Hodaacutesz and Pohl 2005)

also divides sentences into smaller chunks Moreover it uses a multi-level linguistic

similarity technique (surface form lemma

word class) to determine similarity between

two source-language segments

Other attempts involve deeper linguistic

processing techniques In Pekar and Mitkov

(2007) we propose the lsquothird-generation

translation memoryrsquo which introduces the

concept of semantic matching We employ

syntactic and semantic analysis of segments

stored in a TM to produce a generalised

representation of segments which reduces

equivalent lexical syntactic and lexico-

syntactic constructions into a single

representation Then a retrieval mechanism

operating on these generalised

representations is used to search for useful

previous translations in the TM

This study is part of the third-generation

translation memory project of which the

ultimate goal is to produce more intelligent TM

systems using NLP techniques To the best of

our knowledge clause splitting has not

previously been investigated as a possible

method for increasing the number of relevant

retrieved matches

3 Clause splitting for TM matching

Macklovitch and Russel (2000) note that

when a sufficiently close match cannot be

found for a new input sentence current TM

systems are unable to retrieve sentences that

contain the same clauses or other major

phrases

For example (a) below is a new input

sentence composed of twenty five-character

words The TM contains the sentence (b)

which shares an identical substring with

sentence (a) However as this substring only

makes up only 25 of the sentences total

number of characters it is unlikely that current

TM tools would be able to retrieve it as a fuzzy

match

(a) w1 w2 w3 w4 w5 w6 w20

(b) w1 w2 w3 w4 w5 w21 w35

If clause splitting were employed clauses

would be treated as separate segments thus

increasing the likelihood that clauses which

are subparts of larger units could have a match

score sufficiently high to be retrieved by the

TM system

In this paper we compare the effect of

performing clause splitting before retrieving

matches in a TM In each experiment we

identify the difference in matching

performance when a TM tool is used as is and

when the input file and the translation memory

are first run through a clause splitter The

clause splitter we use in this study is a modified version of the one described in Puscasu (2004)

The original version employs both machine

learning and linguistic rules to identify finite

clauses for both English and Romanian but in

this version only the rule-based module is

used Puscasu (2004) developed a clause

splitting method for both English and

Romanian and to maintain consistency

between the two languages her definition of a

clause is the one prescribed by the Romanian

Academy of Grammar which is that a clause is

18

group of words containing a finite verb

Nonfinite and verbless clauses are therefore

not considered The reported F-measure for

identifying complete clauses in English is

8139 (Marsic 2011)

In this study the clause splitter is used on

both the segments in the input file and the translation memory database After processing

the input and the TM segments with the clause

splitter these were then imported into existing

TM tools to examine how well these tools will

perform if clause splitting is used in pre-

processing

4 Experiments

Experiments were performed to study the

impact of clause splitting when used in pre-

processing for the retrieval of segments in a

TM Our hypothesis is that when a clause

splitter is used the number of relevant

retrieved matches will increase

The effect of clause splitting is examined by

comparing the number of matches retrieved

when TM tools are used as is and when both

the input segments and the segments in the

translation memory are first processed with a

clause splitter before being imported into the

TM tools The tools used are Wordfast

Professional 3 and Trados Studio 2009 which

are among the most widely used TM tools

(Lagoudaki 2006)

Segments used as the input were selected

from the Edinburgh paraphrase corpus (Cohn

Callison-Burch and Lapata 2008) (in Macken

2009) We use a paraphrase corpus because we

wish to investigate the effect of using a clause

splitter in pre-processing to retrieve both

segments that contain the entire input clause

and segments that do not contain the exact

input clause but may still be relevant as they

contain a clause that shares a considerable

degree of similarity with the input

We examine the segments retrieved using

both the default fuzzy match threshold (75

for Wordfast and 70 for SDL Trados) and the

minimum threshold (40 for Wordfast and

30 for Trados) It is not normally

recommended for translators to set a low fuzzy

match threshold as this might result in the

retrieval of too many irrelevant segments if the

translation memory is large However in this

study we argue it would be beneficial to

examine matches retrieved with the minimum

threshold as well Given that translation

memory match scores are mainly calculated

using Levenshtein distance if only one clause

in a segment in the TM matches the input there

is a greater chance of the segment being

retrieved with a lower threshold We therefore

wish to examine whether the employment of

clause splitting will still result in a considerable

improvement from using the Levenshtein

distance-based matching algorithm in most TM

tools if the match threshold setting is already

optimised for the retrieval of sub-segment

clauses

It must also be noted that for this study we

are working with the source segments only

Therefore in the TM files used both the source

and target segments are in English

We conducted two main sets of experiments

referred to as Set A and Set B which are

outlined below In Set A we selected sentences

from the Edinburgh corpus that contained

more than one clause We use one clause or

part of it as the input segment and we store

the entire sentence in the TM In the

experiments where no clause splitting is done

the sentence is stored as is In the experiments

with clause splitting the original input

segments are split into clauses (if there are more than one) and the segments in the TM are

the component clauses of the original sentence

For the experiments done without clause

splitting there are 150 input segments and for

each one we test whether the longer

corresponding segment in the TM can be

retrieved For the experiments where clause

splitting is used the 150 input segments are

split into 180 segments as some of these

segments contain more than one clause We

then test whether the corresponding clause

from the original longer sentence can be

retrieved An example is presented in Table 1

19

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 5: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

Organizers

Constantin Orasan University of Wolverhampton UK

Rohit Gupta University of Wolverhampton UK

Program Committee

Manuel Arcedillo Hermes SpainJuanjo Arevalillo Hermes SpainEduard Barbu Translated ItalyYves Champollion WordFast FranceGloria Corpas University of Malaga SpainMaud Ehrmann EPFL SwitzerlandKevin Flanagan Swansea University UKGabriela Gonzalez eTrad ArgentinaManuel Herranz Pangeanic SpainQun Liu DCU IrelandRuslan Mitkov University of Wolverhampton UKGabor Proszeky Morphologic HungaryUwe Reinke Cologne University of Applied Sciences GermanyMichel Simard NRC CanadaMark Shuttleworth UCL UKMasao Utiyama NICT JapanAndy Way DCU IrelandMarcos Zampieri Saarland University and DFKI GermanyVentsislav Zhechev Autodesk

Invited Speaker

Marcello Federico Fondazione Bruno Kessler Italy

v

Table of Contents

Creation of new TM segments Fulfilling translatorsrsquo wishesCarla Parra Escartiacuten 1

Spotting false translation segments in translation memoriesEduard Barbu 9

Improving Translation Memory Matching through Clause SplittingKaterina Raisa Timonera and Ruslan Mitkov 17

Improving translation memory fuzzy matching by paraphrasingKonstantinos Chatzitheodoroou 24

Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Meth-ods

Viacutet Baisa Ales Horak and Marek Medvedrsquo 31

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek Sudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Vela and Josef van

Genabith 36

vii

Conference Program

900ndash910 WelcomeConstantin Orasan

910ndash1010 Automatically tidying up and extending translation memories (invited talk)Marcello Federico FBK Italy

1015ndash1100 Creation of new TM segments Fulfilling translatorsrsquo wishesCarla Parra Escartiacuten

1130ndash1215 Spotting false translation segments in translation memoriesEduard Barbu

1215ndash1300 Improving Translation Memory Matching through Clause SplittingKaterina Raisa Timonera and Ruslan Mitkov

1430ndash1500 Improving translation memory fuzzy matching by paraphrasingKonstantinos Chatzitheodoroou

1500ndash1530 Increasing Coverage of Translation Memories with Linguistically Motivated Seg-ment Combination MethodsViacutet Baisa Ales Horak and Marek Medvedrsquo

1530ndash1600 CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek Sudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Velaand Josef van Genabith

1615ndash1700 Round table and final discussion

ix

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 1ndash8Hissar Bulgaria Sept 2015

Creation of new TM segments Fulfilling translatorsrsquo wishes

Carla Parra EscartiacutenHermes Traducciones

C Coacutelquide 6 portal 2 3ž I28230 Las Rozas Madrid Spain

carlaparrahermestranscom

Abstract

This paper aims at bridging the gap be-tween academic research on TranslationMemories (TMs) and the actual needs andwishes of translators Starting from aninternal survey in a translation companywe analyse what translators wished trans-lation memories offered them We are cur-rently implementing one of their sugges-tions to assess its feasibility and whetheror not it retrieves more TM matches asoriginally expected We report on how thesuggestion is being implemented and theresults obtained in a pilot study

1 Introduction

Professional translators use translation memorieson a daily basis In fact most translation projectsnowadays require the usage of Computer AssistedTranslation (CAT) tools Sometimes translatorswill freely choose to work with a particular tooland sometimes it will be the client who imposesthe usage of such tool One of the main advantagesof CAT tools is that they allow for the integrationof different productivity enhancement features

Translation Memories (TMs) are used to re-trieve past translations and partial past translations(fuzzy matches) Terminology databases (TBs) areused to enhance the coherence on terminology andto ensure that the right terminology is used in allprojects Moreover some CAT tools offer addi-tional functionalities such as the usage of predic-tive texts (Autosuggest in SDL Trados Studio1

and Muses in MemoQ2) the automatic assem-bly of fragments to produce translations of newsegments or specific customizable Quality As-surance (QA) features

1wwwsdlcom2wwwmemoqcom

In the context of a translation company the us-age of these productivity enhacement tools is partof the whole translation workflow Project man-agers use them to generate word counts and esti-mate the time and resources needed to make thetranslation They also use them to pre-translatefiles and clean them up prior to delivery to theclient Translators use these tools to translate re-vise and proofread the translations prior to finaldelivery

Several researchers have worked on enhancingTMs using Natural Language Processing (NLP)techniques (Hodaacutesz and Pohl 2005 Pekar andMitkov 2007 Mitkov and Corpas 2008 Simardand Fujita 2012 Gupta and Orasan 2014Vanallemeersch and Vandeghinste 2015 Guptaet al 2015) Despite reporting positive resultsit seems that the gap between academic researchand industry still exists We carried out an inter-nal survey in our company to detect potential newfeatures for TMs that could be implemented byour RampD team The aim was to identify potentialnew features that project managers and translatorswished for and that would enhance their produc-tivity In this paper we report on the results of thatsurvey and further explain how we are implement-ing one of the suggestions we received The re-mainder of this paper is structured as follows Sec-tion 3 summarises the survey we carried out andthe replies we got and our company is briefly in-troduced in Section 2 Section 4 explains how weimplemented one of the suggestions we receivedSubsections 41ndash43 describe how we created newTM segments out of existing TMs Section 5 re-ports on the evaluation on a pilot test to assess thereal usability of this approach Finally section 6summarises our work and discusses future work

2 Our company

Hermes is a leader company in the translation in-dustry in Spain It was founded in 1991 and has

1

a vast experience in multilingual localisation andtranslation projects With offices in Madrid andMaacutelaga the company has broad knowledge andexperience in computer-assisted translation soft-ware and specific localisation software includingSDL Trados memoQ Deacutejagrave Vu IBM Translation-Manager Star Transit WordFast Catalyst Pas-solo Across Idiom World Server Microsoft He-lium and Microsoft Localisation Studio amongothers

Our company objectives are built upon a solidfoundation which have allowed us to achieve adouble quality certification for translation services(UNE-EN-150382006 and ISO-90012008 stan-dards) Our in-house translators and project man-agers commit daily to provide our clients with highquality translations for different specialised fields(IT medicine technical manuals general textsetc)

3 Internal survey

We asked our in-house translators and projectmanagers for potential new functionalities thatCAT tools could offer them More concretelywe asked them which was according to themthe missing functionality as far as TMs are con-cerned In total 10 project manager and 14 trans-lators participated in this internal survey Whilenot all had a clear idea of what could be imple-mented some interesting suggestions came up

We gathered ideas such as scheduling automaticTM reorganisation to prevent that large TMs endup corrupted It is a well known fact that largeTMs need to be periodically reorganized (ie re-indexed and eventually cleaned-up) While thisfeature is available in standard CAT tools such asStudio 20143 and memoQ4 it is not always car-ried out automatically and it is difficult to estimate

3In previous versions of Studio TM reorganisationwas available for all types of TMs in the Transla-tion Memory Settings Dialog Box gt Performance andTuning As of Studio 2014 file-based TMs are au-tomatically reorganised while server-based TMs re-quire periodical reorganisation For further informa-tion see httpproducthelpsdlcomSDL20Trados20Studioclient_enRefO-TTM_SettingsTranslation_Memory_Settings_Dialog_Box__Performance_and_Tuninghtm

4memoQ actually has a repairing function the Trans-lation memory repair wizard which aims at repairing(ie re-indexing) a corrupted TM According to thedocumentation it is also possible to run this functionon TMs which are not corrupted For further infor-mation see httpkilgraycommemoq2015help-enindexhtmlrepair_resource3html

when such reorganisation should be carried outScheduling it to be run automatically when a par-ticular number of segments (eg 500) have beenadded since the last reorganisation for instancewould prevent the loss of a TM because of badmaintenance

Ideas more related to NLP included the auto-matic correction of orthotypography in the tar-get language and allowing for multilingual TMswhere several source and target languages can beused for concordance searches at once Althoughcurrently it is possible to use multilingual TMs inCAT tools when starting a new project a languagepair has to be selected This leads to an underusageof the TM and the potential benefits of queryingmultiple languages at once are missed If for in-stance the TM contains a translation unit (TU) fora different pair of languages (eg German gt Ital-ian) than the ones selected for that specific project(eg German gt French) and the same source sen-tence is appearing in the text currently being trans-lated the translation into a different target lan-guage will not be shown (ie the Italian translationof the German TU will not be matched) The samewould occur with concordance searches Whilematches for a different language pair will not beused in the translation they may be useful for car-rying it out If a translator understands other targetlanguages these translations may give them a hintas to how to translate the same sentence into theirmother tongue5

Finally an interesting idea was to generate newsegments on the fly from fragments of previouslytranslated segments Flanagan (2015) offers aninteresting overview about the techniques usedby different CAT tools for subsegment matchingHere we will focus on memoQrsquos such functional-ity fragment assembly6

Figure 1 shows how this functionality worksin memoQ As may be observed memoQ looks

5For SDL Studio there seems to be an external appAnyTM that allows users to use TMs having differentlanguage pairs than the ones in the current translationproject As of Studio 2015 this app has been integrated inthe CAT tool and become a new feature However this tooldoes not seem to support the usage of truly multilingual TMs(TMs including several target languages for each segment)For further information on the tool see httpwwwtranslationzonecomopenexchangeappsdltradosstudioanytmtranslationprovider-669html)

6For further information see httpkilgraycommemoq2015help-enindexhtmlfragment_assemblyhtml

2

for fragments of the source sentence in other TMsegments and internally computes their alignmentprobabilities It then inserts the translations intothe source segment and suggests this new some-times partially translated sentence as a match Al-ternatively only the fragments translated will beinserted in the target segment one after anotherwithout the source sentence words that could notbe retrieved

Figure 1 memoQrsquos fragment assembly function-ality

One limitation of this functionality is that thefragment translations follow the order in whichthey appear in the source language Thus whileit may be very useful for pairs of languages whichfollow a similar structure it may be problematicfor pairs of languages which require reorderingmemoQ uses the frequencies of apparition of thefragments to select one translation or another foreach particular segment As a consequence insome cases the translation selection is wrong thusyielding wrong translation suggestions

Examples 1 and 2 show two similar sentencesin our TM

(1) EN The following message then appearsClick accept to run the programES Aparece el siguiente mensaje Hagaclick en aceptar para ejecutar el programa

(2) EN The window will show the followingThe application will closeES La ventana mostraraacute lo siguiente Secerraraacute la aplicacioacuten

Now imagine we need to translate the sentencein 3

(3) EN The following message then appearsThe application will close

Taking the part of 1 before the colon and the partof 2 after the colon we would be able to producethe right translation as shown in 4

(4) ES Aparece el siguiente mensaje Secerraraacute la aplicacioacuten

In technical texts it is often the case that situa-tions like the one just described happen more thanonce Thus it is not surprising that the transla-tors liked the idea and thought it would be nice tofind a way of automatically retrieving their trans-lations without having to do concordance searchesin the TM Moreover remembering that a partic-ular fragment of a segment had been translatedin the past is not always possible as translatorsmay have forgotten it or different translators mayhave been involved in the project thus not seeingfragments of a segment that other translators havetranslated already

As this idea seemed to have a great potential toincrease the number of TM and fuzzy matches wedecided to implement it and test whether it actu-ally worked The next Section (4) explains howwe proceeded

4 The new segment generator

As explained in Section 3 we decided to test oneidea originated from our internal survey The ideawas to generate new TM segments from fragmentsof already existing segments We called our newtool new segment generator

The first step was to assess the type of textsthat are translated in our company and identify thesegment fragments that could be easily extractedUpon analysis of several sample texts we identi-fied 7 different types of fragments we could workwith

1 Ellipsis2 Colons3 Parenthesis4 Square brackets5 Curly braces6 Quotations7 Text surrounded by tags

In the following subsections (41 ndash 43) we de-scribe how each of these types of fragments wastreated

3

41 Ellipsis and colonsOne possible type of segment would be that inwhich an ellipsis () or a colon () is used inthe middle of the segment In software localisationor user guides sentences such as the ones in 5 and6 could appear

(5) EN Installing new services ServiceXXXX for premium clients [2]ES Instalando nuevos servicios ServicioXXXX para clientes premium [2]

(6) EN You can use the line [abcdef] todescribe any of the following characters ab c d e fES Puede utilizar la liacutenea [abcdef] paradescribir cualquiera de los siguientescaracteres a b c d e f

If a different segment only including the textbefore the ellipsis appears as in Example 7 theTM may not retrieve any fuzzy match The samewould occur with other sentences with colons inwhich the fragment before or after the colon ap-pears

(7) Installing new services

In these cases we proceded as follows

1 Check that there is an ellipsis a colon onboth the source and the target segment

2 Split the segment in two being the first partthe fragment of the segment up to the ellip-siscolon and the second part the fragment ofthe segment after the ellipsiscolon

3 Create a new TM segment for each fragment

42 Parenthesis square brackets and curlybraces

Sometimes a sentence includes a fragment be-tween parethesis square brackets or curly bracesThe content within such characters may constitutea new segment on its own or appear in a differ-ent sentence At the same time it may also be thecase that the same sentence appears in the text butwithout such parenthesis When sentences like theones in Examples 8ndash10 appear it may thus be de-sirable to store the translation of the fragment be-tween the aforementioned characters and the sen-tence without such content

(8) EN Creates an installation package forapplication installation (if it was not createdearlier)ES Crea un paquete de instalacioacuten para lainstalacioacuten de la aplicacioacuten (si no se creoacuteantes)

(9) EN ltreturn code 1gt=[ltdescriptiongt]ES ltcoacutedigo de retorno 1gt=[ltdescripcioacutengt]

(10) EN Could not open key [2] Systemerror [3] Verify that you have sufficientaccess to that key or contact your supportpersonnelES Error al abrir la clave [2] Error en elsistema [3] Compruebe que dispone delos derechos de acceso

The strategy to create new segments was the fol-lowing

1 Check that there is content between parenthe-sis square brackets curly braces on both thesource and the target segment

2 Keep three fragments out of each sentence

(a) A sentence removing those charactersand the content between them

(b) A fragment starting at the opening char-acter and finishing on the closing oneand including the content within In thisfragment the parenthesis square brack-ets or curly braces are mantained

(c) A fragment containing only the con-tent withing those characters (parenthe-sis square brackets or curly braces) butwithout them

3 Create a new TM segment for each fragment

At this preliminary stage we considered thatwhen a sentence has several clauses in parenthesissquare brackets or curly braces these appear in thesame order in the target language This was doneso because for the type of texts used so far to testour application (software manuals) and the pair oflanguages used (English into Spanish) this seemsto be the usual case In future work we plan to fur-ther evaluate this issue and consider other waysof ensuring that the right translation is assigned toeach clause

4

43 Quotations and text within tagsQuotations and double tags appearing in the textwere handled differently As the text within thequotations or tags might be part of the sentencewhere it appears it could not be removed withoutadding too much noise to the data Thus we iden-tified sentences with quotations andor tags wethen removed the quotations andor tags and keptthe same sentence without them as a new segmentFinally we also kept the text within the quotationmarks or tags as new segments Examples 11ndash12illustrate this kind of segments

(11) EN You can only set the values of settingsthat the policy allows to be modified that isunlocked settingsES Solo se pueden establecer los valoresde los paraacutemetros que al directiva permitemodificar es decir los paraacutemetrosdesbloqueados

(12) EN If you clear the lt1gtInherit settingsfrom parent policylt1gt check box in thelt2gtSettings inheritancelt2gt section of thelt3gtGenerallt3gt section in the propertieswindow of an inherited policy the lockis lifted for that policyES Si anula la seleccioacuten de la casillalt1gtHeredar configuracioacuten de la directivaprimarialt1gt en la seccioacuten lt2gtHerencia deconfiguracioacutenlt2gt que aparece en la seccioacutenlt3gtGenerallt3gt de la ventana depropiedades de una directiva heredada seabriraacute el candado para esa directiva

5 Pilot test

Before integrating our system in a CAT tool andin our normal production workflows we deemedit better to run a pilot test The aim of this testwas to measure to which extent the new segmentsretrieved an increased number of 100 and fuzzymatches

51 Test setWe used as a test set a real translation project com-ing from one of our clients It is a software manualwritten in English and to be translated into Span-ish We selected memoQ 2015 to be the CAT toolused for our testing because it is one of the com-mon CAT tools used by our translators and be-cause we also wanted to measure the impact of

our approach when using its fragment assemblyfunctionality

The project had in total 425 segments account-ing for 6280 words according to memoQ Ta-ble 1 shows the project statistics as provided bymemoQrsquos analysis tool using the project TM pro-vided by the client Additionally memoQ iden-tified 36 segments (418 words) which could betranslated benefiting from its fragment assemblyfunctionality which uses fragments of segments tocreate new translations

TM match Words SegmentsRepetitions 1064 80

100 0 095-99 4 285-94 0 075-84 285 14

50-74 2523 187No Match 2404 142

Total 6280 425

Table 1 Project statistics according to memoQ us-ing the project TM

Taking this analysis as the starting point of ourpilot test we generated new segments using theapproach described in Section 4 We used threedifferent TMs to further assess whether the sizeof the translation memory matters for generatingtranslations of segments and retrieving more trans-lations The first TM (Project TM) was the projectTM provided by the client The second TM (Prod-uct TM) was a TM including all projects done forthe same product of the client Finally the thirdTM (Client TM) included all projects of that clientand thus was the biggest one Table 2 summarisesthe size of the three TMs

Segments WordsEN ES

Project TM 16842 212472 244159456631

Product TM 20923 274542 317797592339

Client TM 256099 3427861 39517327379593

Table 2 Size of the different TMs used

We then generated new TM segments and storedthem as new TMs Table 3 shows the number ofnew segments generated using our approach

Table 4 breaks down the number of segmentsgenerated using each strategy and for each TM

5

Segments EN ESProject TM 6776 56973 66297

new segments 123270Product TM 7760 71034 83125

new segments 154159new Client TM 74041 662714 769705new segments 1432419

Table 3 Size of the new TMs generated using ourapproach

used As can be observed some types are moreproductive than others When using the smallerTMs (project and product) the most prolific seg-ment generator category was the one which ex-tracted text surrounded by tags However whenusing the whole clientrsquos TM the text betweenparenthesis was more prolific

TM Proj TM Prod TM clientEllipsis 7 6 50Colon 1801 2094 17894Parenthesis 1361 1621 29637Square bra 78 73 598Curly braces 0 0 0Quotations 1085 1146 8797Tags 2523 2892 17792

Table 4 Number of newly generated segments pertype and TM used

We then tested how many segments would beretrieved using our newly created TMs both aloneand in combination with the TMs we previouslyhad MemoQ offers the possibility of activatingand deactivating different TMs when preparing afile for translation We thus prepared the projectfile using 11 combinations to assess which com-bination performed better as well as whether thenew TMs where having any impact in the projectThese 11 scenarios were the following

1 TM1 The project TM as provided by theclient

2 TM2 Only the new segments generatedfrom the project TM provided by the client

3 TM3 A combination of the project TM andthe new segments retrieved from it

4 TM4 Only the new segments generatedfrom the product TM

5 TM5 The project TM and the new segmentsgenerated from the product TM

6 TM6 The project TM combined withthe new segments TMs generated from theproject TM and the ones from the productTM

7 TM7 Only the new segments generatedfrom the client TM

8 TM8 The project TM combined with thenew segments generated from the client TM

9 TM9 The project TM combined with thenew TMs generated from the client TM andthe ones from the project TM

10 TM10 The project TM combined with thenew segments generated from the client TMand the ones from the product TM

11 TM11 The project TM combined with thenew segments generated from the client TMthe ones from the project TM and the onesfrom the product TM

The preparation of a file for translation typ-ically includes both analysing the file and pre-translating it When the fragment assembly func-tionality from memoQ is activated informationabout how many segments could be translated us-ing fragments is also provided Tables 5 and 6summarise the results we obtained for each TMenvironment when preparing the project for trans-lation

As can be observed using the TMs with newsegments decreased in all cases the number of seg-ments not started and increased the number of seg-ments translated using fragments (cf Table 5) Italso seems clear that size matters and that the big-ger the TM with new fragments the higher thenumber of segments that benefit from fragments(cf TM2 TM4 and TM7 in Table 5)

However this does not hold true for the pre-translation In all cases in which the new TMswere used in isolation (TM2 TM4 and TM7) thenumber of pretranslated segments decreases Thismay be due to the fact that those TMs only containfragments of the original segments present in thedifferent TMs used to generate the segments

When combined with the project TM the num-ber of pretranslated segments increases (cf TM3TM5 TM6 and TM8-TM11) The best overall re-sults are obtained when using the project TM ei-ther in combination with both the new TM gener-ated from the project TM and the new TM gen-

6

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194

Table 5 Overview on the number of segments pre-translated translated using fragments or not started

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11

100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87

Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band

erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)

If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band

In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations

from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators

6 Conclusion and future work

In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent

We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators

We have also envisioned the combination ofalready existing methods for retrieving a higher

7

number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)

Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesKevin Flanagan 2015 Subsegment recall in Transla-

tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January

Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)

Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT

Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria

Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008

Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters

Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas

Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT

8

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015

Spotting false translation segments in translation memories

Eduard BarbuTranslatednet

eduardtranslatednet

Abstract

The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories

1 Introduction

MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface

The quality of the translations using the firstmethod is high and the errors are relatively few

1httpsmymemorytranslatednet

6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V

Figure 1 The distribution of automatically ac-quired memories in MyMemory

However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions

The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task

The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments

9

2 Related Work

The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval

There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora

A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours

2httpwwwxbenchnet

3 The data

In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types

1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web

2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)

3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments

4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)

The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors

10

It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1

CG =ls minus ldradic

34(ls + ld)(1)

In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments

If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426

The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the

3This simple idea is implemented in many sentence align-ers

4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom

Church Gale ScoreP

rob

ab

ility C

hu

rch

Ga

le S

co

re

minus4 minus2 0 2 4

00

02

04

06

08

Figure 2 The distribution of Church Gale Scoresin the Matecat Set

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

01

02

03

04

Figure 3 The distribution of Church Gale Scoresin the Collaborative Set

11

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

00

00

04

00

08

Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet

Church Gale score varies in the interval minus131516015

To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6

In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows

bull Training Set It contains 1243 bi-segmentsand has 373 negative example

bull Test Set It contains 309 bi-segments and has87 negatives examples

The proportion of the negative examples in bothsets is approximately 30

minus4 minus2 0 2 4

minus4

minus2

02

4Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 5 The Q-Q plot for the Matecat set

minus4 minus2 0 2 4

minus1

00

minus5

00

50

Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 6 The Q-Q plot for the Collaborative set

12

4 Machine Learning

In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them

41 FeaturesThe features computed for the training and test setare the following

bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt

bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language

bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0

bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0

bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0

bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0

bull has capital letters The value of the featureis 1 if the source or target segments contain

words that have at least a capital letter other-wise is 0

bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters

bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions

bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)

bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties

bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity

bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity

13

bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment

bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters

bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters

bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2

All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation

The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment

is done with the highly accurate language detectorCybozu5

42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)

bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree

bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees

bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores

bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms

bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances

bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can

5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd

14

give good results in classification problemswhere the decision boundary is irregular

5 Results and discussion

We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1

Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071

Table 1 The results of the three-fold stratifiedclassification

Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)

The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2

The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set

Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059

Table 2 The results of the classification on the testset

Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section

6 Conclusions and further work

In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the

15

performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database

There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative

Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471

ReferencesWilliam A Gale and Kenneth W Church 1993 A

program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS

B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April

Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-

lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31

Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830

Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226

Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory

Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas

16

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015

Abstract

We propose the integration of clause

splitting as a pre-processing step for match

retrieval in Translation Memory (TM)

systems to increase the number of relevant

sub-segment matches Through a series of

experiments we investigate the impact of

clause splitting in instances where the input

does not match an entire segment in the

TM but only a clause from a segment Our

results show that there is a statistically

significant increase in the number of

retrieved matches when both the input

segments and the segments in the TM are

first processed with a clause splitter

1 Rationale

Translation memory tools have had a great

impact on the translation industry as they

provide considerable assistance to translators

They allow translators to easily re-use

previous translations providing them with

valuable productivity gains in an industry

where there is a great demand for quality

translation delivered in the shortest possible

time However existing tools have some

shortcomings The majority of existing tools

rely on Levenshtein distance and seek to

identify matches only at the sentence level

Semantically similar segments are therefore

difficult to retrieve if the string similarity is not

high enough as are sub-segment matches

because if only part of a sentence matches the

input even if this part is an entire clause it is

unlikely that this sentence would be retrieved

(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text

types such as updated versions of technical

manuals

In this study we aim to address the

problem of retrieving sub-segment matches by

performing clause splitting on the source

segment as a pre-processing step for TM match

retrieval While matches for entire sentences

or almost entire sentences are the most useful

type of matches it is also less likely for such

matches to be found in most text types and

even less so for complex sentences Retrieving

clauses is desirable because there is a higher

chance for a match to be found for a clause than

for a complex sentence and at the same time

clauses are similar to sentences in that they

both contain a subject and a verb hence a

ldquocomplete thoughtrdquo therefore clause matches

are more likely to be in context and to actually

be used by the translator than phrase matches

for example

We perform experiments comparing the

match retrieval performance of TM tools when

they are used as is and when the input file and

the TM segments are first processed with a

clause splitter before being fed into the TM

tool The paper is organised as follows In

section 2 we discuss related work on TM

matching In section 3 we discuss how clause

splitting can be beneficial to TM matching and

how clause splitting was implemented for this

study We then describe our experiments in

section 4 and discuss the results in section 5

Finally we present our conclusion and future

work in section 6

Improving Translation Memory Matching through Clause

Splitting

Katerina Timonera

Research Group in Computational

Linguistics

University of Wolverhampton

krtimoneragmailcom

Ruslan Mitkov

Research Group in Computational

Linguistics

University of Wolverhampton

rmitkovwlvacuk

17

2 Related Work

Attempts to address the shortcomings of

existing tools include the integration of

language processing to break down a sentence

into smaller segments The so-called lsquosecond

generationrsquo TM system Similis (Planas 2005)

performs chunking to split sentences into syntagmas to allow sub-sentence matching

However Reinke (2013) observes that for

certain language pairs like English-German

only rather short phrases like simple NPs are

identified and larger syntactic units cannot be

retrieved This can be regarded as disadvantage

as the processing of larger units would be

desirable for the support of professional

computer-assisted human translation (Kriele

(2006) and Macken (2009) cited in Reinke

2013)

MetaMorphoTM (Hodaacutesz and Pohl 2005)

also divides sentences into smaller chunks Moreover it uses a multi-level linguistic

similarity technique (surface form lemma

word class) to determine similarity between

two source-language segments

Other attempts involve deeper linguistic

processing techniques In Pekar and Mitkov

(2007) we propose the lsquothird-generation

translation memoryrsquo which introduces the

concept of semantic matching We employ

syntactic and semantic analysis of segments

stored in a TM to produce a generalised

representation of segments which reduces

equivalent lexical syntactic and lexico-

syntactic constructions into a single

representation Then a retrieval mechanism

operating on these generalised

representations is used to search for useful

previous translations in the TM

This study is part of the third-generation

translation memory project of which the

ultimate goal is to produce more intelligent TM

systems using NLP techniques To the best of

our knowledge clause splitting has not

previously been investigated as a possible

method for increasing the number of relevant

retrieved matches

3 Clause splitting for TM matching

Macklovitch and Russel (2000) note that

when a sufficiently close match cannot be

found for a new input sentence current TM

systems are unable to retrieve sentences that

contain the same clauses or other major

phrases

For example (a) below is a new input

sentence composed of twenty five-character

words The TM contains the sentence (b)

which shares an identical substring with

sentence (a) However as this substring only

makes up only 25 of the sentences total

number of characters it is unlikely that current

TM tools would be able to retrieve it as a fuzzy

match

(a) w1 w2 w3 w4 w5 w6 w20

(b) w1 w2 w3 w4 w5 w21 w35

If clause splitting were employed clauses

would be treated as separate segments thus

increasing the likelihood that clauses which

are subparts of larger units could have a match

score sufficiently high to be retrieved by the

TM system

In this paper we compare the effect of

performing clause splitting before retrieving

matches in a TM In each experiment we

identify the difference in matching

performance when a TM tool is used as is and

when the input file and the translation memory

are first run through a clause splitter The

clause splitter we use in this study is a modified version of the one described in Puscasu (2004)

The original version employs both machine

learning and linguistic rules to identify finite

clauses for both English and Romanian but in

this version only the rule-based module is

used Puscasu (2004) developed a clause

splitting method for both English and

Romanian and to maintain consistency

between the two languages her definition of a

clause is the one prescribed by the Romanian

Academy of Grammar which is that a clause is

18

group of words containing a finite verb

Nonfinite and verbless clauses are therefore

not considered The reported F-measure for

identifying complete clauses in English is

8139 (Marsic 2011)

In this study the clause splitter is used on

both the segments in the input file and the translation memory database After processing

the input and the TM segments with the clause

splitter these were then imported into existing

TM tools to examine how well these tools will

perform if clause splitting is used in pre-

processing

4 Experiments

Experiments were performed to study the

impact of clause splitting when used in pre-

processing for the retrieval of segments in a

TM Our hypothesis is that when a clause

splitter is used the number of relevant

retrieved matches will increase

The effect of clause splitting is examined by

comparing the number of matches retrieved

when TM tools are used as is and when both

the input segments and the segments in the

translation memory are first processed with a

clause splitter before being imported into the

TM tools The tools used are Wordfast

Professional 3 and Trados Studio 2009 which

are among the most widely used TM tools

(Lagoudaki 2006)

Segments used as the input were selected

from the Edinburgh paraphrase corpus (Cohn

Callison-Burch and Lapata 2008) (in Macken

2009) We use a paraphrase corpus because we

wish to investigate the effect of using a clause

splitter in pre-processing to retrieve both

segments that contain the entire input clause

and segments that do not contain the exact

input clause but may still be relevant as they

contain a clause that shares a considerable

degree of similarity with the input

We examine the segments retrieved using

both the default fuzzy match threshold (75

for Wordfast and 70 for SDL Trados) and the

minimum threshold (40 for Wordfast and

30 for Trados) It is not normally

recommended for translators to set a low fuzzy

match threshold as this might result in the

retrieval of too many irrelevant segments if the

translation memory is large However in this

study we argue it would be beneficial to

examine matches retrieved with the minimum

threshold as well Given that translation

memory match scores are mainly calculated

using Levenshtein distance if only one clause

in a segment in the TM matches the input there

is a greater chance of the segment being

retrieved with a lower threshold We therefore

wish to examine whether the employment of

clause splitting will still result in a considerable

improvement from using the Levenshtein

distance-based matching algorithm in most TM

tools if the match threshold setting is already

optimised for the retrieval of sub-segment

clauses

It must also be noted that for this study we

are working with the source segments only

Therefore in the TM files used both the source

and target segments are in English

We conducted two main sets of experiments

referred to as Set A and Set B which are

outlined below In Set A we selected sentences

from the Edinburgh corpus that contained

more than one clause We use one clause or

part of it as the input segment and we store

the entire sentence in the TM In the

experiments where no clause splitting is done

the sentence is stored as is In the experiments

with clause splitting the original input

segments are split into clauses (if there are more than one) and the segments in the TM are

the component clauses of the original sentence

For the experiments done without clause

splitting there are 150 input segments and for

each one we test whether the longer

corresponding segment in the TM can be

retrieved For the experiments where clause

splitting is used the 150 input segments are

split into 180 segments as some of these

segments contain more than one clause We

then test whether the corresponding clause

from the original longer sentence can be

retrieved An example is presented in Table 1

19

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 6: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

Table of Contents

Creation of new TM segments Fulfilling translatorsrsquo wishesCarla Parra Escartiacuten 1

Spotting false translation segments in translation memoriesEduard Barbu 9

Improving Translation Memory Matching through Clause SplittingKaterina Raisa Timonera and Ruslan Mitkov 17

Improving translation memory fuzzy matching by paraphrasingKonstantinos Chatzitheodoroou 24

Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Meth-ods

Viacutet Baisa Ales Horak and Marek Medvedrsquo 31

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek Sudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Vela and Josef van

Genabith 36

vii

Conference Program

900ndash910 WelcomeConstantin Orasan

910ndash1010 Automatically tidying up and extending translation memories (invited talk)Marcello Federico FBK Italy

1015ndash1100 Creation of new TM segments Fulfilling translatorsrsquo wishesCarla Parra Escartiacuten

1130ndash1215 Spotting false translation segments in translation memoriesEduard Barbu

1215ndash1300 Improving Translation Memory Matching through Clause SplittingKaterina Raisa Timonera and Ruslan Mitkov

1430ndash1500 Improving translation memory fuzzy matching by paraphrasingKonstantinos Chatzitheodoroou

1500ndash1530 Increasing Coverage of Translation Memories with Linguistically Motivated Seg-ment Combination MethodsViacutet Baisa Ales Horak and Marek Medvedrsquo

1530ndash1600 CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek Sudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Velaand Josef van Genabith

1615ndash1700 Round table and final discussion

ix

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 1ndash8Hissar Bulgaria Sept 2015

Creation of new TM segments Fulfilling translatorsrsquo wishes

Carla Parra EscartiacutenHermes Traducciones

C Coacutelquide 6 portal 2 3ž I28230 Las Rozas Madrid Spain

carlaparrahermestranscom

Abstract

This paper aims at bridging the gap be-tween academic research on TranslationMemories (TMs) and the actual needs andwishes of translators Starting from aninternal survey in a translation companywe analyse what translators wished trans-lation memories offered them We are cur-rently implementing one of their sugges-tions to assess its feasibility and whetheror not it retrieves more TM matches asoriginally expected We report on how thesuggestion is being implemented and theresults obtained in a pilot study

1 Introduction

Professional translators use translation memorieson a daily basis In fact most translation projectsnowadays require the usage of Computer AssistedTranslation (CAT) tools Sometimes translatorswill freely choose to work with a particular tooland sometimes it will be the client who imposesthe usage of such tool One of the main advantagesof CAT tools is that they allow for the integrationof different productivity enhancement features

Translation Memories (TMs) are used to re-trieve past translations and partial past translations(fuzzy matches) Terminology databases (TBs) areused to enhance the coherence on terminology andto ensure that the right terminology is used in allprojects Moreover some CAT tools offer addi-tional functionalities such as the usage of predic-tive texts (Autosuggest in SDL Trados Studio1

and Muses in MemoQ2) the automatic assem-bly of fragments to produce translations of newsegments or specific customizable Quality As-surance (QA) features

1wwwsdlcom2wwwmemoqcom

In the context of a translation company the us-age of these productivity enhacement tools is partof the whole translation workflow Project man-agers use them to generate word counts and esti-mate the time and resources needed to make thetranslation They also use them to pre-translatefiles and clean them up prior to delivery to theclient Translators use these tools to translate re-vise and proofread the translations prior to finaldelivery

Several researchers have worked on enhancingTMs using Natural Language Processing (NLP)techniques (Hodaacutesz and Pohl 2005 Pekar andMitkov 2007 Mitkov and Corpas 2008 Simardand Fujita 2012 Gupta and Orasan 2014Vanallemeersch and Vandeghinste 2015 Guptaet al 2015) Despite reporting positive resultsit seems that the gap between academic researchand industry still exists We carried out an inter-nal survey in our company to detect potential newfeatures for TMs that could be implemented byour RampD team The aim was to identify potentialnew features that project managers and translatorswished for and that would enhance their produc-tivity In this paper we report on the results of thatsurvey and further explain how we are implement-ing one of the suggestions we received The re-mainder of this paper is structured as follows Sec-tion 3 summarises the survey we carried out andthe replies we got and our company is briefly in-troduced in Section 2 Section 4 explains how weimplemented one of the suggestions we receivedSubsections 41ndash43 describe how we created newTM segments out of existing TMs Section 5 re-ports on the evaluation on a pilot test to assess thereal usability of this approach Finally section 6summarises our work and discusses future work

2 Our company

Hermes is a leader company in the translation in-dustry in Spain It was founded in 1991 and has

1

a vast experience in multilingual localisation andtranslation projects With offices in Madrid andMaacutelaga the company has broad knowledge andexperience in computer-assisted translation soft-ware and specific localisation software includingSDL Trados memoQ Deacutejagrave Vu IBM Translation-Manager Star Transit WordFast Catalyst Pas-solo Across Idiom World Server Microsoft He-lium and Microsoft Localisation Studio amongothers

Our company objectives are built upon a solidfoundation which have allowed us to achieve adouble quality certification for translation services(UNE-EN-150382006 and ISO-90012008 stan-dards) Our in-house translators and project man-agers commit daily to provide our clients with highquality translations for different specialised fields(IT medicine technical manuals general textsetc)

3 Internal survey

We asked our in-house translators and projectmanagers for potential new functionalities thatCAT tools could offer them More concretelywe asked them which was according to themthe missing functionality as far as TMs are con-cerned In total 10 project manager and 14 trans-lators participated in this internal survey Whilenot all had a clear idea of what could be imple-mented some interesting suggestions came up

We gathered ideas such as scheduling automaticTM reorganisation to prevent that large TMs endup corrupted It is a well known fact that largeTMs need to be periodically reorganized (ie re-indexed and eventually cleaned-up) While thisfeature is available in standard CAT tools such asStudio 20143 and memoQ4 it is not always car-ried out automatically and it is difficult to estimate

3In previous versions of Studio TM reorganisationwas available for all types of TMs in the Transla-tion Memory Settings Dialog Box gt Performance andTuning As of Studio 2014 file-based TMs are au-tomatically reorganised while server-based TMs re-quire periodical reorganisation For further informa-tion see httpproducthelpsdlcomSDL20Trados20Studioclient_enRefO-TTM_SettingsTranslation_Memory_Settings_Dialog_Box__Performance_and_Tuninghtm

4memoQ actually has a repairing function the Trans-lation memory repair wizard which aims at repairing(ie re-indexing) a corrupted TM According to thedocumentation it is also possible to run this functionon TMs which are not corrupted For further infor-mation see httpkilgraycommemoq2015help-enindexhtmlrepair_resource3html

when such reorganisation should be carried outScheduling it to be run automatically when a par-ticular number of segments (eg 500) have beenadded since the last reorganisation for instancewould prevent the loss of a TM because of badmaintenance

Ideas more related to NLP included the auto-matic correction of orthotypography in the tar-get language and allowing for multilingual TMswhere several source and target languages can beused for concordance searches at once Althoughcurrently it is possible to use multilingual TMs inCAT tools when starting a new project a languagepair has to be selected This leads to an underusageof the TM and the potential benefits of queryingmultiple languages at once are missed If for in-stance the TM contains a translation unit (TU) fora different pair of languages (eg German gt Ital-ian) than the ones selected for that specific project(eg German gt French) and the same source sen-tence is appearing in the text currently being trans-lated the translation into a different target lan-guage will not be shown (ie the Italian translationof the German TU will not be matched) The samewould occur with concordance searches Whilematches for a different language pair will not beused in the translation they may be useful for car-rying it out If a translator understands other targetlanguages these translations may give them a hintas to how to translate the same sentence into theirmother tongue5

Finally an interesting idea was to generate newsegments on the fly from fragments of previouslytranslated segments Flanagan (2015) offers aninteresting overview about the techniques usedby different CAT tools for subsegment matchingHere we will focus on memoQrsquos such functional-ity fragment assembly6

Figure 1 shows how this functionality worksin memoQ As may be observed memoQ looks

5For SDL Studio there seems to be an external appAnyTM that allows users to use TMs having differentlanguage pairs than the ones in the current translationproject As of Studio 2015 this app has been integrated inthe CAT tool and become a new feature However this tooldoes not seem to support the usage of truly multilingual TMs(TMs including several target languages for each segment)For further information on the tool see httpwwwtranslationzonecomopenexchangeappsdltradosstudioanytmtranslationprovider-669html)

6For further information see httpkilgraycommemoq2015help-enindexhtmlfragment_assemblyhtml

2

for fragments of the source sentence in other TMsegments and internally computes their alignmentprobabilities It then inserts the translations intothe source segment and suggests this new some-times partially translated sentence as a match Al-ternatively only the fragments translated will beinserted in the target segment one after anotherwithout the source sentence words that could notbe retrieved

Figure 1 memoQrsquos fragment assembly function-ality

One limitation of this functionality is that thefragment translations follow the order in whichthey appear in the source language Thus whileit may be very useful for pairs of languages whichfollow a similar structure it may be problematicfor pairs of languages which require reorderingmemoQ uses the frequencies of apparition of thefragments to select one translation or another foreach particular segment As a consequence insome cases the translation selection is wrong thusyielding wrong translation suggestions

Examples 1 and 2 show two similar sentencesin our TM

(1) EN The following message then appearsClick accept to run the programES Aparece el siguiente mensaje Hagaclick en aceptar para ejecutar el programa

(2) EN The window will show the followingThe application will closeES La ventana mostraraacute lo siguiente Secerraraacute la aplicacioacuten

Now imagine we need to translate the sentencein 3

(3) EN The following message then appearsThe application will close

Taking the part of 1 before the colon and the partof 2 after the colon we would be able to producethe right translation as shown in 4

(4) ES Aparece el siguiente mensaje Secerraraacute la aplicacioacuten

In technical texts it is often the case that situa-tions like the one just described happen more thanonce Thus it is not surprising that the transla-tors liked the idea and thought it would be nice tofind a way of automatically retrieving their trans-lations without having to do concordance searchesin the TM Moreover remembering that a partic-ular fragment of a segment had been translatedin the past is not always possible as translatorsmay have forgotten it or different translators mayhave been involved in the project thus not seeingfragments of a segment that other translators havetranslated already

As this idea seemed to have a great potential toincrease the number of TM and fuzzy matches wedecided to implement it and test whether it actu-ally worked The next Section (4) explains howwe proceeded

4 The new segment generator

As explained in Section 3 we decided to test oneidea originated from our internal survey The ideawas to generate new TM segments from fragmentsof already existing segments We called our newtool new segment generator

The first step was to assess the type of textsthat are translated in our company and identify thesegment fragments that could be easily extractedUpon analysis of several sample texts we identi-fied 7 different types of fragments we could workwith

1 Ellipsis2 Colons3 Parenthesis4 Square brackets5 Curly braces6 Quotations7 Text surrounded by tags

In the following subsections (41 ndash 43) we de-scribe how each of these types of fragments wastreated

3

41 Ellipsis and colonsOne possible type of segment would be that inwhich an ellipsis () or a colon () is used inthe middle of the segment In software localisationor user guides sentences such as the ones in 5 and6 could appear

(5) EN Installing new services ServiceXXXX for premium clients [2]ES Instalando nuevos servicios ServicioXXXX para clientes premium [2]

(6) EN You can use the line [abcdef] todescribe any of the following characters ab c d e fES Puede utilizar la liacutenea [abcdef] paradescribir cualquiera de los siguientescaracteres a b c d e f

If a different segment only including the textbefore the ellipsis appears as in Example 7 theTM may not retrieve any fuzzy match The samewould occur with other sentences with colons inwhich the fragment before or after the colon ap-pears

(7) Installing new services

In these cases we proceded as follows

1 Check that there is an ellipsis a colon onboth the source and the target segment

2 Split the segment in two being the first partthe fragment of the segment up to the ellip-siscolon and the second part the fragment ofthe segment after the ellipsiscolon

3 Create a new TM segment for each fragment

42 Parenthesis square brackets and curlybraces

Sometimes a sentence includes a fragment be-tween parethesis square brackets or curly bracesThe content within such characters may constitutea new segment on its own or appear in a differ-ent sentence At the same time it may also be thecase that the same sentence appears in the text butwithout such parenthesis When sentences like theones in Examples 8ndash10 appear it may thus be de-sirable to store the translation of the fragment be-tween the aforementioned characters and the sen-tence without such content

(8) EN Creates an installation package forapplication installation (if it was not createdearlier)ES Crea un paquete de instalacioacuten para lainstalacioacuten de la aplicacioacuten (si no se creoacuteantes)

(9) EN ltreturn code 1gt=[ltdescriptiongt]ES ltcoacutedigo de retorno 1gt=[ltdescripcioacutengt]

(10) EN Could not open key [2] Systemerror [3] Verify that you have sufficientaccess to that key or contact your supportpersonnelES Error al abrir la clave [2] Error en elsistema [3] Compruebe que dispone delos derechos de acceso

The strategy to create new segments was the fol-lowing

1 Check that there is content between parenthe-sis square brackets curly braces on both thesource and the target segment

2 Keep three fragments out of each sentence

(a) A sentence removing those charactersand the content between them

(b) A fragment starting at the opening char-acter and finishing on the closing oneand including the content within In thisfragment the parenthesis square brack-ets or curly braces are mantained

(c) A fragment containing only the con-tent withing those characters (parenthe-sis square brackets or curly braces) butwithout them

3 Create a new TM segment for each fragment

At this preliminary stage we considered thatwhen a sentence has several clauses in parenthesissquare brackets or curly braces these appear in thesame order in the target language This was doneso because for the type of texts used so far to testour application (software manuals) and the pair oflanguages used (English into Spanish) this seemsto be the usual case In future work we plan to fur-ther evaluate this issue and consider other waysof ensuring that the right translation is assigned toeach clause

4

43 Quotations and text within tagsQuotations and double tags appearing in the textwere handled differently As the text within thequotations or tags might be part of the sentencewhere it appears it could not be removed withoutadding too much noise to the data Thus we iden-tified sentences with quotations andor tags wethen removed the quotations andor tags and keptthe same sentence without them as a new segmentFinally we also kept the text within the quotationmarks or tags as new segments Examples 11ndash12illustrate this kind of segments

(11) EN You can only set the values of settingsthat the policy allows to be modified that isunlocked settingsES Solo se pueden establecer los valoresde los paraacutemetros que al directiva permitemodificar es decir los paraacutemetrosdesbloqueados

(12) EN If you clear the lt1gtInherit settingsfrom parent policylt1gt check box in thelt2gtSettings inheritancelt2gt section of thelt3gtGenerallt3gt section in the propertieswindow of an inherited policy the lockis lifted for that policyES Si anula la seleccioacuten de la casillalt1gtHeredar configuracioacuten de la directivaprimarialt1gt en la seccioacuten lt2gtHerencia deconfiguracioacutenlt2gt que aparece en la seccioacutenlt3gtGenerallt3gt de la ventana depropiedades de una directiva heredada seabriraacute el candado para esa directiva

5 Pilot test

Before integrating our system in a CAT tool andin our normal production workflows we deemedit better to run a pilot test The aim of this testwas to measure to which extent the new segmentsretrieved an increased number of 100 and fuzzymatches

51 Test setWe used as a test set a real translation project com-ing from one of our clients It is a software manualwritten in English and to be translated into Span-ish We selected memoQ 2015 to be the CAT toolused for our testing because it is one of the com-mon CAT tools used by our translators and be-cause we also wanted to measure the impact of

our approach when using its fragment assemblyfunctionality

The project had in total 425 segments account-ing for 6280 words according to memoQ Ta-ble 1 shows the project statistics as provided bymemoQrsquos analysis tool using the project TM pro-vided by the client Additionally memoQ iden-tified 36 segments (418 words) which could betranslated benefiting from its fragment assemblyfunctionality which uses fragments of segments tocreate new translations

TM match Words SegmentsRepetitions 1064 80

100 0 095-99 4 285-94 0 075-84 285 14

50-74 2523 187No Match 2404 142

Total 6280 425

Table 1 Project statistics according to memoQ us-ing the project TM

Taking this analysis as the starting point of ourpilot test we generated new segments using theapproach described in Section 4 We used threedifferent TMs to further assess whether the sizeof the translation memory matters for generatingtranslations of segments and retrieving more trans-lations The first TM (Project TM) was the projectTM provided by the client The second TM (Prod-uct TM) was a TM including all projects done forthe same product of the client Finally the thirdTM (Client TM) included all projects of that clientand thus was the biggest one Table 2 summarisesthe size of the three TMs

Segments WordsEN ES

Project TM 16842 212472 244159456631

Product TM 20923 274542 317797592339

Client TM 256099 3427861 39517327379593

Table 2 Size of the different TMs used

We then generated new TM segments and storedthem as new TMs Table 3 shows the number ofnew segments generated using our approach

Table 4 breaks down the number of segmentsgenerated using each strategy and for each TM

5

Segments EN ESProject TM 6776 56973 66297

new segments 123270Product TM 7760 71034 83125

new segments 154159new Client TM 74041 662714 769705new segments 1432419

Table 3 Size of the new TMs generated using ourapproach

used As can be observed some types are moreproductive than others When using the smallerTMs (project and product) the most prolific seg-ment generator category was the one which ex-tracted text surrounded by tags However whenusing the whole clientrsquos TM the text betweenparenthesis was more prolific

TM Proj TM Prod TM clientEllipsis 7 6 50Colon 1801 2094 17894Parenthesis 1361 1621 29637Square bra 78 73 598Curly braces 0 0 0Quotations 1085 1146 8797Tags 2523 2892 17792

Table 4 Number of newly generated segments pertype and TM used

We then tested how many segments would beretrieved using our newly created TMs both aloneand in combination with the TMs we previouslyhad MemoQ offers the possibility of activatingand deactivating different TMs when preparing afile for translation We thus prepared the projectfile using 11 combinations to assess which com-bination performed better as well as whether thenew TMs where having any impact in the projectThese 11 scenarios were the following

1 TM1 The project TM as provided by theclient

2 TM2 Only the new segments generatedfrom the project TM provided by the client

3 TM3 A combination of the project TM andthe new segments retrieved from it

4 TM4 Only the new segments generatedfrom the product TM

5 TM5 The project TM and the new segmentsgenerated from the product TM

6 TM6 The project TM combined withthe new segments TMs generated from theproject TM and the ones from the productTM

7 TM7 Only the new segments generatedfrom the client TM

8 TM8 The project TM combined with thenew segments generated from the client TM

9 TM9 The project TM combined with thenew TMs generated from the client TM andthe ones from the project TM

10 TM10 The project TM combined with thenew segments generated from the client TMand the ones from the product TM

11 TM11 The project TM combined with thenew segments generated from the client TMthe ones from the project TM and the onesfrom the product TM

The preparation of a file for translation typ-ically includes both analysing the file and pre-translating it When the fragment assembly func-tionality from memoQ is activated informationabout how many segments could be translated us-ing fragments is also provided Tables 5 and 6summarise the results we obtained for each TMenvironment when preparing the project for trans-lation

As can be observed using the TMs with newsegments decreased in all cases the number of seg-ments not started and increased the number of seg-ments translated using fragments (cf Table 5) Italso seems clear that size matters and that the big-ger the TM with new fragments the higher thenumber of segments that benefit from fragments(cf TM2 TM4 and TM7 in Table 5)

However this does not hold true for the pre-translation In all cases in which the new TMswere used in isolation (TM2 TM4 and TM7) thenumber of pretranslated segments decreases Thismay be due to the fact that those TMs only containfragments of the original segments present in thedifferent TMs used to generate the segments

When combined with the project TM the num-ber of pretranslated segments increases (cf TM3TM5 TM6 and TM8-TM11) The best overall re-sults are obtained when using the project TM ei-ther in combination with both the new TM gener-ated from the project TM and the new TM gen-

6

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194

Table 5 Overview on the number of segments pre-translated translated using fragments or not started

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11

100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87

Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band

erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)

If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band

In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations

from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators

6 Conclusion and future work

In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent

We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators

We have also envisioned the combination ofalready existing methods for retrieving a higher

7

number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)

Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesKevin Flanagan 2015 Subsegment recall in Transla-

tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January

Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)

Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT

Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria

Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008

Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters

Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas

Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT

8

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015

Spotting false translation segments in translation memories

Eduard BarbuTranslatednet

eduardtranslatednet

Abstract

The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories

1 Introduction

MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface

The quality of the translations using the firstmethod is high and the errors are relatively few

1httpsmymemorytranslatednet

6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V

Figure 1 The distribution of automatically ac-quired memories in MyMemory

However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions

The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task

The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments

9

2 Related Work

The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval

There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora

A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours

2httpwwwxbenchnet

3 The data

In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types

1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web

2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)

3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments

4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)

The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors

10

It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1

CG =ls minus ldradic

34(ls + ld)(1)

In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments

If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426

The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the

3This simple idea is implemented in many sentence align-ers

4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom

Church Gale ScoreP

rob

ab

ility C

hu

rch

Ga

le S

co

re

minus4 minus2 0 2 4

00

02

04

06

08

Figure 2 The distribution of Church Gale Scoresin the Matecat Set

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

01

02

03

04

Figure 3 The distribution of Church Gale Scoresin the Collaborative Set

11

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

00

00

04

00

08

Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet

Church Gale score varies in the interval minus131516015

To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6

In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows

bull Training Set It contains 1243 bi-segmentsand has 373 negative example

bull Test Set It contains 309 bi-segments and has87 negatives examples

The proportion of the negative examples in bothsets is approximately 30

minus4 minus2 0 2 4

minus4

minus2

02

4Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 5 The Q-Q plot for the Matecat set

minus4 minus2 0 2 4

minus1

00

minus5

00

50

Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 6 The Q-Q plot for the Collaborative set

12

4 Machine Learning

In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them

41 FeaturesThe features computed for the training and test setare the following

bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt

bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language

bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0

bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0

bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0

bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0

bull has capital letters The value of the featureis 1 if the source or target segments contain

words that have at least a capital letter other-wise is 0

bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters

bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions

bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)

bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties

bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity

bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity

13

bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment

bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters

bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters

bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2

All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation

The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment

is done with the highly accurate language detectorCybozu5

42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)

bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree

bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees

bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores

bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms

bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances

bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can

5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd

14

give good results in classification problemswhere the decision boundary is irregular

5 Results and discussion

We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1

Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071

Table 1 The results of the three-fold stratifiedclassification

Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)

The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2

The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set

Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059

Table 2 The results of the classification on the testset

Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section

6 Conclusions and further work

In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the

15

performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database

There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative

Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471

ReferencesWilliam A Gale and Kenneth W Church 1993 A

program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS

B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April

Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-

lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31

Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830

Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226

Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory

Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas

16

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015

Abstract

We propose the integration of clause

splitting as a pre-processing step for match

retrieval in Translation Memory (TM)

systems to increase the number of relevant

sub-segment matches Through a series of

experiments we investigate the impact of

clause splitting in instances where the input

does not match an entire segment in the

TM but only a clause from a segment Our

results show that there is a statistically

significant increase in the number of

retrieved matches when both the input

segments and the segments in the TM are

first processed with a clause splitter

1 Rationale

Translation memory tools have had a great

impact on the translation industry as they

provide considerable assistance to translators

They allow translators to easily re-use

previous translations providing them with

valuable productivity gains in an industry

where there is a great demand for quality

translation delivered in the shortest possible

time However existing tools have some

shortcomings The majority of existing tools

rely on Levenshtein distance and seek to

identify matches only at the sentence level

Semantically similar segments are therefore

difficult to retrieve if the string similarity is not

high enough as are sub-segment matches

because if only part of a sentence matches the

input even if this part is an entire clause it is

unlikely that this sentence would be retrieved

(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text

types such as updated versions of technical

manuals

In this study we aim to address the

problem of retrieving sub-segment matches by

performing clause splitting on the source

segment as a pre-processing step for TM match

retrieval While matches for entire sentences

or almost entire sentences are the most useful

type of matches it is also less likely for such

matches to be found in most text types and

even less so for complex sentences Retrieving

clauses is desirable because there is a higher

chance for a match to be found for a clause than

for a complex sentence and at the same time

clauses are similar to sentences in that they

both contain a subject and a verb hence a

ldquocomplete thoughtrdquo therefore clause matches

are more likely to be in context and to actually

be used by the translator than phrase matches

for example

We perform experiments comparing the

match retrieval performance of TM tools when

they are used as is and when the input file and

the TM segments are first processed with a

clause splitter before being fed into the TM

tool The paper is organised as follows In

section 2 we discuss related work on TM

matching In section 3 we discuss how clause

splitting can be beneficial to TM matching and

how clause splitting was implemented for this

study We then describe our experiments in

section 4 and discuss the results in section 5

Finally we present our conclusion and future

work in section 6

Improving Translation Memory Matching through Clause

Splitting

Katerina Timonera

Research Group in Computational

Linguistics

University of Wolverhampton

krtimoneragmailcom

Ruslan Mitkov

Research Group in Computational

Linguistics

University of Wolverhampton

rmitkovwlvacuk

17

2 Related Work

Attempts to address the shortcomings of

existing tools include the integration of

language processing to break down a sentence

into smaller segments The so-called lsquosecond

generationrsquo TM system Similis (Planas 2005)

performs chunking to split sentences into syntagmas to allow sub-sentence matching

However Reinke (2013) observes that for

certain language pairs like English-German

only rather short phrases like simple NPs are

identified and larger syntactic units cannot be

retrieved This can be regarded as disadvantage

as the processing of larger units would be

desirable for the support of professional

computer-assisted human translation (Kriele

(2006) and Macken (2009) cited in Reinke

2013)

MetaMorphoTM (Hodaacutesz and Pohl 2005)

also divides sentences into smaller chunks Moreover it uses a multi-level linguistic

similarity technique (surface form lemma

word class) to determine similarity between

two source-language segments

Other attempts involve deeper linguistic

processing techniques In Pekar and Mitkov

(2007) we propose the lsquothird-generation

translation memoryrsquo which introduces the

concept of semantic matching We employ

syntactic and semantic analysis of segments

stored in a TM to produce a generalised

representation of segments which reduces

equivalent lexical syntactic and lexico-

syntactic constructions into a single

representation Then a retrieval mechanism

operating on these generalised

representations is used to search for useful

previous translations in the TM

This study is part of the third-generation

translation memory project of which the

ultimate goal is to produce more intelligent TM

systems using NLP techniques To the best of

our knowledge clause splitting has not

previously been investigated as a possible

method for increasing the number of relevant

retrieved matches

3 Clause splitting for TM matching

Macklovitch and Russel (2000) note that

when a sufficiently close match cannot be

found for a new input sentence current TM

systems are unable to retrieve sentences that

contain the same clauses or other major

phrases

For example (a) below is a new input

sentence composed of twenty five-character

words The TM contains the sentence (b)

which shares an identical substring with

sentence (a) However as this substring only

makes up only 25 of the sentences total

number of characters it is unlikely that current

TM tools would be able to retrieve it as a fuzzy

match

(a) w1 w2 w3 w4 w5 w6 w20

(b) w1 w2 w3 w4 w5 w21 w35

If clause splitting were employed clauses

would be treated as separate segments thus

increasing the likelihood that clauses which

are subparts of larger units could have a match

score sufficiently high to be retrieved by the

TM system

In this paper we compare the effect of

performing clause splitting before retrieving

matches in a TM In each experiment we

identify the difference in matching

performance when a TM tool is used as is and

when the input file and the translation memory

are first run through a clause splitter The

clause splitter we use in this study is a modified version of the one described in Puscasu (2004)

The original version employs both machine

learning and linguistic rules to identify finite

clauses for both English and Romanian but in

this version only the rule-based module is

used Puscasu (2004) developed a clause

splitting method for both English and

Romanian and to maintain consistency

between the two languages her definition of a

clause is the one prescribed by the Romanian

Academy of Grammar which is that a clause is

18

group of words containing a finite verb

Nonfinite and verbless clauses are therefore

not considered The reported F-measure for

identifying complete clauses in English is

8139 (Marsic 2011)

In this study the clause splitter is used on

both the segments in the input file and the translation memory database After processing

the input and the TM segments with the clause

splitter these were then imported into existing

TM tools to examine how well these tools will

perform if clause splitting is used in pre-

processing

4 Experiments

Experiments were performed to study the

impact of clause splitting when used in pre-

processing for the retrieval of segments in a

TM Our hypothesis is that when a clause

splitter is used the number of relevant

retrieved matches will increase

The effect of clause splitting is examined by

comparing the number of matches retrieved

when TM tools are used as is and when both

the input segments and the segments in the

translation memory are first processed with a

clause splitter before being imported into the

TM tools The tools used are Wordfast

Professional 3 and Trados Studio 2009 which

are among the most widely used TM tools

(Lagoudaki 2006)

Segments used as the input were selected

from the Edinburgh paraphrase corpus (Cohn

Callison-Burch and Lapata 2008) (in Macken

2009) We use a paraphrase corpus because we

wish to investigate the effect of using a clause

splitter in pre-processing to retrieve both

segments that contain the entire input clause

and segments that do not contain the exact

input clause but may still be relevant as they

contain a clause that shares a considerable

degree of similarity with the input

We examine the segments retrieved using

both the default fuzzy match threshold (75

for Wordfast and 70 for SDL Trados) and the

minimum threshold (40 for Wordfast and

30 for Trados) It is not normally

recommended for translators to set a low fuzzy

match threshold as this might result in the

retrieval of too many irrelevant segments if the

translation memory is large However in this

study we argue it would be beneficial to

examine matches retrieved with the minimum

threshold as well Given that translation

memory match scores are mainly calculated

using Levenshtein distance if only one clause

in a segment in the TM matches the input there

is a greater chance of the segment being

retrieved with a lower threshold We therefore

wish to examine whether the employment of

clause splitting will still result in a considerable

improvement from using the Levenshtein

distance-based matching algorithm in most TM

tools if the match threshold setting is already

optimised for the retrieval of sub-segment

clauses

It must also be noted that for this study we

are working with the source segments only

Therefore in the TM files used both the source

and target segments are in English

We conducted two main sets of experiments

referred to as Set A and Set B which are

outlined below In Set A we selected sentences

from the Edinburgh corpus that contained

more than one clause We use one clause or

part of it as the input segment and we store

the entire sentence in the TM In the

experiments where no clause splitting is done

the sentence is stored as is In the experiments

with clause splitting the original input

segments are split into clauses (if there are more than one) and the segments in the TM are

the component clauses of the original sentence

For the experiments done without clause

splitting there are 150 input segments and for

each one we test whether the longer

corresponding segment in the TM can be

retrieved For the experiments where clause

splitting is used the 150 input segments are

split into 180 segments as some of these

segments contain more than one clause We

then test whether the corresponding clause

from the original longer sentence can be

retrieved An example is presented in Table 1

19

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 7: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

Conference Program

900ndash910 WelcomeConstantin Orasan

910ndash1010 Automatically tidying up and extending translation memories (invited talk)Marcello Federico FBK Italy

1015ndash1100 Creation of new TM segments Fulfilling translatorsrsquo wishesCarla Parra Escartiacuten

1130ndash1215 Spotting false translation segments in translation memoriesEduard Barbu

1215ndash1300 Improving Translation Memory Matching through Clause SplittingKaterina Raisa Timonera and Ruslan Mitkov

1430ndash1500 Improving translation memory fuzzy matching by paraphrasingKonstantinos Chatzitheodoroou

1500ndash1530 Increasing Coverage of Translation Memories with Linguistically Motivated Seg-ment Combination MethodsViacutet Baisa Ales Horak and Marek Medvedrsquo

1530ndash1600 CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek Sudip Kumar Naskar Santanu Pal Marcos Zampieri Mihaela Velaand Josef van Genabith

1615ndash1700 Round table and final discussion

ix

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 1ndash8Hissar Bulgaria Sept 2015

Creation of new TM segments Fulfilling translatorsrsquo wishes

Carla Parra EscartiacutenHermes Traducciones

C Coacutelquide 6 portal 2 3ž I28230 Las Rozas Madrid Spain

carlaparrahermestranscom

Abstract

This paper aims at bridging the gap be-tween academic research on TranslationMemories (TMs) and the actual needs andwishes of translators Starting from aninternal survey in a translation companywe analyse what translators wished trans-lation memories offered them We are cur-rently implementing one of their sugges-tions to assess its feasibility and whetheror not it retrieves more TM matches asoriginally expected We report on how thesuggestion is being implemented and theresults obtained in a pilot study

1 Introduction

Professional translators use translation memorieson a daily basis In fact most translation projectsnowadays require the usage of Computer AssistedTranslation (CAT) tools Sometimes translatorswill freely choose to work with a particular tooland sometimes it will be the client who imposesthe usage of such tool One of the main advantagesof CAT tools is that they allow for the integrationof different productivity enhancement features

Translation Memories (TMs) are used to re-trieve past translations and partial past translations(fuzzy matches) Terminology databases (TBs) areused to enhance the coherence on terminology andto ensure that the right terminology is used in allprojects Moreover some CAT tools offer addi-tional functionalities such as the usage of predic-tive texts (Autosuggest in SDL Trados Studio1

and Muses in MemoQ2) the automatic assem-bly of fragments to produce translations of newsegments or specific customizable Quality As-surance (QA) features

1wwwsdlcom2wwwmemoqcom

In the context of a translation company the us-age of these productivity enhacement tools is partof the whole translation workflow Project man-agers use them to generate word counts and esti-mate the time and resources needed to make thetranslation They also use them to pre-translatefiles and clean them up prior to delivery to theclient Translators use these tools to translate re-vise and proofread the translations prior to finaldelivery

Several researchers have worked on enhancingTMs using Natural Language Processing (NLP)techniques (Hodaacutesz and Pohl 2005 Pekar andMitkov 2007 Mitkov and Corpas 2008 Simardand Fujita 2012 Gupta and Orasan 2014Vanallemeersch and Vandeghinste 2015 Guptaet al 2015) Despite reporting positive resultsit seems that the gap between academic researchand industry still exists We carried out an inter-nal survey in our company to detect potential newfeatures for TMs that could be implemented byour RampD team The aim was to identify potentialnew features that project managers and translatorswished for and that would enhance their produc-tivity In this paper we report on the results of thatsurvey and further explain how we are implement-ing one of the suggestions we received The re-mainder of this paper is structured as follows Sec-tion 3 summarises the survey we carried out andthe replies we got and our company is briefly in-troduced in Section 2 Section 4 explains how weimplemented one of the suggestions we receivedSubsections 41ndash43 describe how we created newTM segments out of existing TMs Section 5 re-ports on the evaluation on a pilot test to assess thereal usability of this approach Finally section 6summarises our work and discusses future work

2 Our company

Hermes is a leader company in the translation in-dustry in Spain It was founded in 1991 and has

1

a vast experience in multilingual localisation andtranslation projects With offices in Madrid andMaacutelaga the company has broad knowledge andexperience in computer-assisted translation soft-ware and specific localisation software includingSDL Trados memoQ Deacutejagrave Vu IBM Translation-Manager Star Transit WordFast Catalyst Pas-solo Across Idiom World Server Microsoft He-lium and Microsoft Localisation Studio amongothers

Our company objectives are built upon a solidfoundation which have allowed us to achieve adouble quality certification for translation services(UNE-EN-150382006 and ISO-90012008 stan-dards) Our in-house translators and project man-agers commit daily to provide our clients with highquality translations for different specialised fields(IT medicine technical manuals general textsetc)

3 Internal survey

We asked our in-house translators and projectmanagers for potential new functionalities thatCAT tools could offer them More concretelywe asked them which was according to themthe missing functionality as far as TMs are con-cerned In total 10 project manager and 14 trans-lators participated in this internal survey Whilenot all had a clear idea of what could be imple-mented some interesting suggestions came up

We gathered ideas such as scheduling automaticTM reorganisation to prevent that large TMs endup corrupted It is a well known fact that largeTMs need to be periodically reorganized (ie re-indexed and eventually cleaned-up) While thisfeature is available in standard CAT tools such asStudio 20143 and memoQ4 it is not always car-ried out automatically and it is difficult to estimate

3In previous versions of Studio TM reorganisationwas available for all types of TMs in the Transla-tion Memory Settings Dialog Box gt Performance andTuning As of Studio 2014 file-based TMs are au-tomatically reorganised while server-based TMs re-quire periodical reorganisation For further informa-tion see httpproducthelpsdlcomSDL20Trados20Studioclient_enRefO-TTM_SettingsTranslation_Memory_Settings_Dialog_Box__Performance_and_Tuninghtm

4memoQ actually has a repairing function the Trans-lation memory repair wizard which aims at repairing(ie re-indexing) a corrupted TM According to thedocumentation it is also possible to run this functionon TMs which are not corrupted For further infor-mation see httpkilgraycommemoq2015help-enindexhtmlrepair_resource3html

when such reorganisation should be carried outScheduling it to be run automatically when a par-ticular number of segments (eg 500) have beenadded since the last reorganisation for instancewould prevent the loss of a TM because of badmaintenance

Ideas more related to NLP included the auto-matic correction of orthotypography in the tar-get language and allowing for multilingual TMswhere several source and target languages can beused for concordance searches at once Althoughcurrently it is possible to use multilingual TMs inCAT tools when starting a new project a languagepair has to be selected This leads to an underusageof the TM and the potential benefits of queryingmultiple languages at once are missed If for in-stance the TM contains a translation unit (TU) fora different pair of languages (eg German gt Ital-ian) than the ones selected for that specific project(eg German gt French) and the same source sen-tence is appearing in the text currently being trans-lated the translation into a different target lan-guage will not be shown (ie the Italian translationof the German TU will not be matched) The samewould occur with concordance searches Whilematches for a different language pair will not beused in the translation they may be useful for car-rying it out If a translator understands other targetlanguages these translations may give them a hintas to how to translate the same sentence into theirmother tongue5

Finally an interesting idea was to generate newsegments on the fly from fragments of previouslytranslated segments Flanagan (2015) offers aninteresting overview about the techniques usedby different CAT tools for subsegment matchingHere we will focus on memoQrsquos such functional-ity fragment assembly6

Figure 1 shows how this functionality worksin memoQ As may be observed memoQ looks

5For SDL Studio there seems to be an external appAnyTM that allows users to use TMs having differentlanguage pairs than the ones in the current translationproject As of Studio 2015 this app has been integrated inthe CAT tool and become a new feature However this tooldoes not seem to support the usage of truly multilingual TMs(TMs including several target languages for each segment)For further information on the tool see httpwwwtranslationzonecomopenexchangeappsdltradosstudioanytmtranslationprovider-669html)

6For further information see httpkilgraycommemoq2015help-enindexhtmlfragment_assemblyhtml

2

for fragments of the source sentence in other TMsegments and internally computes their alignmentprobabilities It then inserts the translations intothe source segment and suggests this new some-times partially translated sentence as a match Al-ternatively only the fragments translated will beinserted in the target segment one after anotherwithout the source sentence words that could notbe retrieved

Figure 1 memoQrsquos fragment assembly function-ality

One limitation of this functionality is that thefragment translations follow the order in whichthey appear in the source language Thus whileit may be very useful for pairs of languages whichfollow a similar structure it may be problematicfor pairs of languages which require reorderingmemoQ uses the frequencies of apparition of thefragments to select one translation or another foreach particular segment As a consequence insome cases the translation selection is wrong thusyielding wrong translation suggestions

Examples 1 and 2 show two similar sentencesin our TM

(1) EN The following message then appearsClick accept to run the programES Aparece el siguiente mensaje Hagaclick en aceptar para ejecutar el programa

(2) EN The window will show the followingThe application will closeES La ventana mostraraacute lo siguiente Secerraraacute la aplicacioacuten

Now imagine we need to translate the sentencein 3

(3) EN The following message then appearsThe application will close

Taking the part of 1 before the colon and the partof 2 after the colon we would be able to producethe right translation as shown in 4

(4) ES Aparece el siguiente mensaje Secerraraacute la aplicacioacuten

In technical texts it is often the case that situa-tions like the one just described happen more thanonce Thus it is not surprising that the transla-tors liked the idea and thought it would be nice tofind a way of automatically retrieving their trans-lations without having to do concordance searchesin the TM Moreover remembering that a partic-ular fragment of a segment had been translatedin the past is not always possible as translatorsmay have forgotten it or different translators mayhave been involved in the project thus not seeingfragments of a segment that other translators havetranslated already

As this idea seemed to have a great potential toincrease the number of TM and fuzzy matches wedecided to implement it and test whether it actu-ally worked The next Section (4) explains howwe proceeded

4 The new segment generator

As explained in Section 3 we decided to test oneidea originated from our internal survey The ideawas to generate new TM segments from fragmentsof already existing segments We called our newtool new segment generator

The first step was to assess the type of textsthat are translated in our company and identify thesegment fragments that could be easily extractedUpon analysis of several sample texts we identi-fied 7 different types of fragments we could workwith

1 Ellipsis2 Colons3 Parenthesis4 Square brackets5 Curly braces6 Quotations7 Text surrounded by tags

In the following subsections (41 ndash 43) we de-scribe how each of these types of fragments wastreated

3

41 Ellipsis and colonsOne possible type of segment would be that inwhich an ellipsis () or a colon () is used inthe middle of the segment In software localisationor user guides sentences such as the ones in 5 and6 could appear

(5) EN Installing new services ServiceXXXX for premium clients [2]ES Instalando nuevos servicios ServicioXXXX para clientes premium [2]

(6) EN You can use the line [abcdef] todescribe any of the following characters ab c d e fES Puede utilizar la liacutenea [abcdef] paradescribir cualquiera de los siguientescaracteres a b c d e f

If a different segment only including the textbefore the ellipsis appears as in Example 7 theTM may not retrieve any fuzzy match The samewould occur with other sentences with colons inwhich the fragment before or after the colon ap-pears

(7) Installing new services

In these cases we proceded as follows

1 Check that there is an ellipsis a colon onboth the source and the target segment

2 Split the segment in two being the first partthe fragment of the segment up to the ellip-siscolon and the second part the fragment ofthe segment after the ellipsiscolon

3 Create a new TM segment for each fragment

42 Parenthesis square brackets and curlybraces

Sometimes a sentence includes a fragment be-tween parethesis square brackets or curly bracesThe content within such characters may constitutea new segment on its own or appear in a differ-ent sentence At the same time it may also be thecase that the same sentence appears in the text butwithout such parenthesis When sentences like theones in Examples 8ndash10 appear it may thus be de-sirable to store the translation of the fragment be-tween the aforementioned characters and the sen-tence without such content

(8) EN Creates an installation package forapplication installation (if it was not createdearlier)ES Crea un paquete de instalacioacuten para lainstalacioacuten de la aplicacioacuten (si no se creoacuteantes)

(9) EN ltreturn code 1gt=[ltdescriptiongt]ES ltcoacutedigo de retorno 1gt=[ltdescripcioacutengt]

(10) EN Could not open key [2] Systemerror [3] Verify that you have sufficientaccess to that key or contact your supportpersonnelES Error al abrir la clave [2] Error en elsistema [3] Compruebe que dispone delos derechos de acceso

The strategy to create new segments was the fol-lowing

1 Check that there is content between parenthe-sis square brackets curly braces on both thesource and the target segment

2 Keep three fragments out of each sentence

(a) A sentence removing those charactersand the content between them

(b) A fragment starting at the opening char-acter and finishing on the closing oneand including the content within In thisfragment the parenthesis square brack-ets or curly braces are mantained

(c) A fragment containing only the con-tent withing those characters (parenthe-sis square brackets or curly braces) butwithout them

3 Create a new TM segment for each fragment

At this preliminary stage we considered thatwhen a sentence has several clauses in parenthesissquare brackets or curly braces these appear in thesame order in the target language This was doneso because for the type of texts used so far to testour application (software manuals) and the pair oflanguages used (English into Spanish) this seemsto be the usual case In future work we plan to fur-ther evaluate this issue and consider other waysof ensuring that the right translation is assigned toeach clause

4

43 Quotations and text within tagsQuotations and double tags appearing in the textwere handled differently As the text within thequotations or tags might be part of the sentencewhere it appears it could not be removed withoutadding too much noise to the data Thus we iden-tified sentences with quotations andor tags wethen removed the quotations andor tags and keptthe same sentence without them as a new segmentFinally we also kept the text within the quotationmarks or tags as new segments Examples 11ndash12illustrate this kind of segments

(11) EN You can only set the values of settingsthat the policy allows to be modified that isunlocked settingsES Solo se pueden establecer los valoresde los paraacutemetros que al directiva permitemodificar es decir los paraacutemetrosdesbloqueados

(12) EN If you clear the lt1gtInherit settingsfrom parent policylt1gt check box in thelt2gtSettings inheritancelt2gt section of thelt3gtGenerallt3gt section in the propertieswindow of an inherited policy the lockis lifted for that policyES Si anula la seleccioacuten de la casillalt1gtHeredar configuracioacuten de la directivaprimarialt1gt en la seccioacuten lt2gtHerencia deconfiguracioacutenlt2gt que aparece en la seccioacutenlt3gtGenerallt3gt de la ventana depropiedades de una directiva heredada seabriraacute el candado para esa directiva

5 Pilot test

Before integrating our system in a CAT tool andin our normal production workflows we deemedit better to run a pilot test The aim of this testwas to measure to which extent the new segmentsretrieved an increased number of 100 and fuzzymatches

51 Test setWe used as a test set a real translation project com-ing from one of our clients It is a software manualwritten in English and to be translated into Span-ish We selected memoQ 2015 to be the CAT toolused for our testing because it is one of the com-mon CAT tools used by our translators and be-cause we also wanted to measure the impact of

our approach when using its fragment assemblyfunctionality

The project had in total 425 segments account-ing for 6280 words according to memoQ Ta-ble 1 shows the project statistics as provided bymemoQrsquos analysis tool using the project TM pro-vided by the client Additionally memoQ iden-tified 36 segments (418 words) which could betranslated benefiting from its fragment assemblyfunctionality which uses fragments of segments tocreate new translations

TM match Words SegmentsRepetitions 1064 80

100 0 095-99 4 285-94 0 075-84 285 14

50-74 2523 187No Match 2404 142

Total 6280 425

Table 1 Project statistics according to memoQ us-ing the project TM

Taking this analysis as the starting point of ourpilot test we generated new segments using theapproach described in Section 4 We used threedifferent TMs to further assess whether the sizeof the translation memory matters for generatingtranslations of segments and retrieving more trans-lations The first TM (Project TM) was the projectTM provided by the client The second TM (Prod-uct TM) was a TM including all projects done forthe same product of the client Finally the thirdTM (Client TM) included all projects of that clientand thus was the biggest one Table 2 summarisesthe size of the three TMs

Segments WordsEN ES

Project TM 16842 212472 244159456631

Product TM 20923 274542 317797592339

Client TM 256099 3427861 39517327379593

Table 2 Size of the different TMs used

We then generated new TM segments and storedthem as new TMs Table 3 shows the number ofnew segments generated using our approach

Table 4 breaks down the number of segmentsgenerated using each strategy and for each TM

5

Segments EN ESProject TM 6776 56973 66297

new segments 123270Product TM 7760 71034 83125

new segments 154159new Client TM 74041 662714 769705new segments 1432419

Table 3 Size of the new TMs generated using ourapproach

used As can be observed some types are moreproductive than others When using the smallerTMs (project and product) the most prolific seg-ment generator category was the one which ex-tracted text surrounded by tags However whenusing the whole clientrsquos TM the text betweenparenthesis was more prolific

TM Proj TM Prod TM clientEllipsis 7 6 50Colon 1801 2094 17894Parenthesis 1361 1621 29637Square bra 78 73 598Curly braces 0 0 0Quotations 1085 1146 8797Tags 2523 2892 17792

Table 4 Number of newly generated segments pertype and TM used

We then tested how many segments would beretrieved using our newly created TMs both aloneand in combination with the TMs we previouslyhad MemoQ offers the possibility of activatingand deactivating different TMs when preparing afile for translation We thus prepared the projectfile using 11 combinations to assess which com-bination performed better as well as whether thenew TMs where having any impact in the projectThese 11 scenarios were the following

1 TM1 The project TM as provided by theclient

2 TM2 Only the new segments generatedfrom the project TM provided by the client

3 TM3 A combination of the project TM andthe new segments retrieved from it

4 TM4 Only the new segments generatedfrom the product TM

5 TM5 The project TM and the new segmentsgenerated from the product TM

6 TM6 The project TM combined withthe new segments TMs generated from theproject TM and the ones from the productTM

7 TM7 Only the new segments generatedfrom the client TM

8 TM8 The project TM combined with thenew segments generated from the client TM

9 TM9 The project TM combined with thenew TMs generated from the client TM andthe ones from the project TM

10 TM10 The project TM combined with thenew segments generated from the client TMand the ones from the product TM

11 TM11 The project TM combined with thenew segments generated from the client TMthe ones from the project TM and the onesfrom the product TM

The preparation of a file for translation typ-ically includes both analysing the file and pre-translating it When the fragment assembly func-tionality from memoQ is activated informationabout how many segments could be translated us-ing fragments is also provided Tables 5 and 6summarise the results we obtained for each TMenvironment when preparing the project for trans-lation

As can be observed using the TMs with newsegments decreased in all cases the number of seg-ments not started and increased the number of seg-ments translated using fragments (cf Table 5) Italso seems clear that size matters and that the big-ger the TM with new fragments the higher thenumber of segments that benefit from fragments(cf TM2 TM4 and TM7 in Table 5)

However this does not hold true for the pre-translation In all cases in which the new TMswere used in isolation (TM2 TM4 and TM7) thenumber of pretranslated segments decreases Thismay be due to the fact that those TMs only containfragments of the original segments present in thedifferent TMs used to generate the segments

When combined with the project TM the num-ber of pretranslated segments increases (cf TM3TM5 TM6 and TM8-TM11) The best overall re-sults are obtained when using the project TM ei-ther in combination with both the new TM gener-ated from the project TM and the new TM gen-

6

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194

Table 5 Overview on the number of segments pre-translated translated using fragments or not started

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11

100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87

Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band

erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)

If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band

In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations

from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators

6 Conclusion and future work

In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent

We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators

We have also envisioned the combination ofalready existing methods for retrieving a higher

7

number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)

Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesKevin Flanagan 2015 Subsegment recall in Transla-

tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January

Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)

Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT

Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria

Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008

Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters

Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas

Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT

8

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015

Spotting false translation segments in translation memories

Eduard BarbuTranslatednet

eduardtranslatednet

Abstract

The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories

1 Introduction

MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface

The quality of the translations using the firstmethod is high and the errors are relatively few

1httpsmymemorytranslatednet

6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V

Figure 1 The distribution of automatically ac-quired memories in MyMemory

However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions

The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task

The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments

9

2 Related Work

The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval

There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora

A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours

2httpwwwxbenchnet

3 The data

In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types

1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web

2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)

3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments

4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)

The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors

10

It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1

CG =ls minus ldradic

34(ls + ld)(1)

In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments

If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426

The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the

3This simple idea is implemented in many sentence align-ers

4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom

Church Gale ScoreP

rob

ab

ility C

hu

rch

Ga

le S

co

re

minus4 minus2 0 2 4

00

02

04

06

08

Figure 2 The distribution of Church Gale Scoresin the Matecat Set

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

01

02

03

04

Figure 3 The distribution of Church Gale Scoresin the Collaborative Set

11

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

00

00

04

00

08

Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet

Church Gale score varies in the interval minus131516015

To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6

In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows

bull Training Set It contains 1243 bi-segmentsand has 373 negative example

bull Test Set It contains 309 bi-segments and has87 negatives examples

The proportion of the negative examples in bothsets is approximately 30

minus4 minus2 0 2 4

minus4

minus2

02

4Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 5 The Q-Q plot for the Matecat set

minus4 minus2 0 2 4

minus1

00

minus5

00

50

Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 6 The Q-Q plot for the Collaborative set

12

4 Machine Learning

In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them

41 FeaturesThe features computed for the training and test setare the following

bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt

bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language

bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0

bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0

bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0

bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0

bull has capital letters The value of the featureis 1 if the source or target segments contain

words that have at least a capital letter other-wise is 0

bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters

bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions

bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)

bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties

bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity

bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity

13

bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment

bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters

bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters

bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2

All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation

The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment

is done with the highly accurate language detectorCybozu5

42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)

bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree

bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees

bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores

bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms

bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances

bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can

5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd

14

give good results in classification problemswhere the decision boundary is irregular

5 Results and discussion

We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1

Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071

Table 1 The results of the three-fold stratifiedclassification

Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)

The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2

The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set

Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059

Table 2 The results of the classification on the testset

Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section

6 Conclusions and further work

In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the

15

performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database

There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative

Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471

ReferencesWilliam A Gale and Kenneth W Church 1993 A

program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS

B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April

Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-

lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31

Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830

Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226

Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory

Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas

16

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015

Abstract

We propose the integration of clause

splitting as a pre-processing step for match

retrieval in Translation Memory (TM)

systems to increase the number of relevant

sub-segment matches Through a series of

experiments we investigate the impact of

clause splitting in instances where the input

does not match an entire segment in the

TM but only a clause from a segment Our

results show that there is a statistically

significant increase in the number of

retrieved matches when both the input

segments and the segments in the TM are

first processed with a clause splitter

1 Rationale

Translation memory tools have had a great

impact on the translation industry as they

provide considerable assistance to translators

They allow translators to easily re-use

previous translations providing them with

valuable productivity gains in an industry

where there is a great demand for quality

translation delivered in the shortest possible

time However existing tools have some

shortcomings The majority of existing tools

rely on Levenshtein distance and seek to

identify matches only at the sentence level

Semantically similar segments are therefore

difficult to retrieve if the string similarity is not

high enough as are sub-segment matches

because if only part of a sentence matches the

input even if this part is an entire clause it is

unlikely that this sentence would be retrieved

(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text

types such as updated versions of technical

manuals

In this study we aim to address the

problem of retrieving sub-segment matches by

performing clause splitting on the source

segment as a pre-processing step for TM match

retrieval While matches for entire sentences

or almost entire sentences are the most useful

type of matches it is also less likely for such

matches to be found in most text types and

even less so for complex sentences Retrieving

clauses is desirable because there is a higher

chance for a match to be found for a clause than

for a complex sentence and at the same time

clauses are similar to sentences in that they

both contain a subject and a verb hence a

ldquocomplete thoughtrdquo therefore clause matches

are more likely to be in context and to actually

be used by the translator than phrase matches

for example

We perform experiments comparing the

match retrieval performance of TM tools when

they are used as is and when the input file and

the TM segments are first processed with a

clause splitter before being fed into the TM

tool The paper is organised as follows In

section 2 we discuss related work on TM

matching In section 3 we discuss how clause

splitting can be beneficial to TM matching and

how clause splitting was implemented for this

study We then describe our experiments in

section 4 and discuss the results in section 5

Finally we present our conclusion and future

work in section 6

Improving Translation Memory Matching through Clause

Splitting

Katerina Timonera

Research Group in Computational

Linguistics

University of Wolverhampton

krtimoneragmailcom

Ruslan Mitkov

Research Group in Computational

Linguistics

University of Wolverhampton

rmitkovwlvacuk

17

2 Related Work

Attempts to address the shortcomings of

existing tools include the integration of

language processing to break down a sentence

into smaller segments The so-called lsquosecond

generationrsquo TM system Similis (Planas 2005)

performs chunking to split sentences into syntagmas to allow sub-sentence matching

However Reinke (2013) observes that for

certain language pairs like English-German

only rather short phrases like simple NPs are

identified and larger syntactic units cannot be

retrieved This can be regarded as disadvantage

as the processing of larger units would be

desirable for the support of professional

computer-assisted human translation (Kriele

(2006) and Macken (2009) cited in Reinke

2013)

MetaMorphoTM (Hodaacutesz and Pohl 2005)

also divides sentences into smaller chunks Moreover it uses a multi-level linguistic

similarity technique (surface form lemma

word class) to determine similarity between

two source-language segments

Other attempts involve deeper linguistic

processing techniques In Pekar and Mitkov

(2007) we propose the lsquothird-generation

translation memoryrsquo which introduces the

concept of semantic matching We employ

syntactic and semantic analysis of segments

stored in a TM to produce a generalised

representation of segments which reduces

equivalent lexical syntactic and lexico-

syntactic constructions into a single

representation Then a retrieval mechanism

operating on these generalised

representations is used to search for useful

previous translations in the TM

This study is part of the third-generation

translation memory project of which the

ultimate goal is to produce more intelligent TM

systems using NLP techniques To the best of

our knowledge clause splitting has not

previously been investigated as a possible

method for increasing the number of relevant

retrieved matches

3 Clause splitting for TM matching

Macklovitch and Russel (2000) note that

when a sufficiently close match cannot be

found for a new input sentence current TM

systems are unable to retrieve sentences that

contain the same clauses or other major

phrases

For example (a) below is a new input

sentence composed of twenty five-character

words The TM contains the sentence (b)

which shares an identical substring with

sentence (a) However as this substring only

makes up only 25 of the sentences total

number of characters it is unlikely that current

TM tools would be able to retrieve it as a fuzzy

match

(a) w1 w2 w3 w4 w5 w6 w20

(b) w1 w2 w3 w4 w5 w21 w35

If clause splitting were employed clauses

would be treated as separate segments thus

increasing the likelihood that clauses which

are subparts of larger units could have a match

score sufficiently high to be retrieved by the

TM system

In this paper we compare the effect of

performing clause splitting before retrieving

matches in a TM In each experiment we

identify the difference in matching

performance when a TM tool is used as is and

when the input file and the translation memory

are first run through a clause splitter The

clause splitter we use in this study is a modified version of the one described in Puscasu (2004)

The original version employs both machine

learning and linguistic rules to identify finite

clauses for both English and Romanian but in

this version only the rule-based module is

used Puscasu (2004) developed a clause

splitting method for both English and

Romanian and to maintain consistency

between the two languages her definition of a

clause is the one prescribed by the Romanian

Academy of Grammar which is that a clause is

18

group of words containing a finite verb

Nonfinite and verbless clauses are therefore

not considered The reported F-measure for

identifying complete clauses in English is

8139 (Marsic 2011)

In this study the clause splitter is used on

both the segments in the input file and the translation memory database After processing

the input and the TM segments with the clause

splitter these were then imported into existing

TM tools to examine how well these tools will

perform if clause splitting is used in pre-

processing

4 Experiments

Experiments were performed to study the

impact of clause splitting when used in pre-

processing for the retrieval of segments in a

TM Our hypothesis is that when a clause

splitter is used the number of relevant

retrieved matches will increase

The effect of clause splitting is examined by

comparing the number of matches retrieved

when TM tools are used as is and when both

the input segments and the segments in the

translation memory are first processed with a

clause splitter before being imported into the

TM tools The tools used are Wordfast

Professional 3 and Trados Studio 2009 which

are among the most widely used TM tools

(Lagoudaki 2006)

Segments used as the input were selected

from the Edinburgh paraphrase corpus (Cohn

Callison-Burch and Lapata 2008) (in Macken

2009) We use a paraphrase corpus because we

wish to investigate the effect of using a clause

splitter in pre-processing to retrieve both

segments that contain the entire input clause

and segments that do not contain the exact

input clause but may still be relevant as they

contain a clause that shares a considerable

degree of similarity with the input

We examine the segments retrieved using

both the default fuzzy match threshold (75

for Wordfast and 70 for SDL Trados) and the

minimum threshold (40 for Wordfast and

30 for Trados) It is not normally

recommended for translators to set a low fuzzy

match threshold as this might result in the

retrieval of too many irrelevant segments if the

translation memory is large However in this

study we argue it would be beneficial to

examine matches retrieved with the minimum

threshold as well Given that translation

memory match scores are mainly calculated

using Levenshtein distance if only one clause

in a segment in the TM matches the input there

is a greater chance of the segment being

retrieved with a lower threshold We therefore

wish to examine whether the employment of

clause splitting will still result in a considerable

improvement from using the Levenshtein

distance-based matching algorithm in most TM

tools if the match threshold setting is already

optimised for the retrieval of sub-segment

clauses

It must also be noted that for this study we

are working with the source segments only

Therefore in the TM files used both the source

and target segments are in English

We conducted two main sets of experiments

referred to as Set A and Set B which are

outlined below In Set A we selected sentences

from the Edinburgh corpus that contained

more than one clause We use one clause or

part of it as the input segment and we store

the entire sentence in the TM In the

experiments where no clause splitting is done

the sentence is stored as is In the experiments

with clause splitting the original input

segments are split into clauses (if there are more than one) and the segments in the TM are

the component clauses of the original sentence

For the experiments done without clause

splitting there are 150 input segments and for

each one we test whether the longer

corresponding segment in the TM can be

retrieved For the experiments where clause

splitting is used the 150 input segments are

split into 180 segments as some of these

segments contain more than one clause We

then test whether the corresponding clause

from the original longer sentence can be

retrieved An example is presented in Table 1

19

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 8: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 1ndash8Hissar Bulgaria Sept 2015

Creation of new TM segments Fulfilling translatorsrsquo wishes

Carla Parra EscartiacutenHermes Traducciones

C Coacutelquide 6 portal 2 3ž I28230 Las Rozas Madrid Spain

carlaparrahermestranscom

Abstract

This paper aims at bridging the gap be-tween academic research on TranslationMemories (TMs) and the actual needs andwishes of translators Starting from aninternal survey in a translation companywe analyse what translators wished trans-lation memories offered them We are cur-rently implementing one of their sugges-tions to assess its feasibility and whetheror not it retrieves more TM matches asoriginally expected We report on how thesuggestion is being implemented and theresults obtained in a pilot study

1 Introduction

Professional translators use translation memorieson a daily basis In fact most translation projectsnowadays require the usage of Computer AssistedTranslation (CAT) tools Sometimes translatorswill freely choose to work with a particular tooland sometimes it will be the client who imposesthe usage of such tool One of the main advantagesof CAT tools is that they allow for the integrationof different productivity enhancement features

Translation Memories (TMs) are used to re-trieve past translations and partial past translations(fuzzy matches) Terminology databases (TBs) areused to enhance the coherence on terminology andto ensure that the right terminology is used in allprojects Moreover some CAT tools offer addi-tional functionalities such as the usage of predic-tive texts (Autosuggest in SDL Trados Studio1

and Muses in MemoQ2) the automatic assem-bly of fragments to produce translations of newsegments or specific customizable Quality As-surance (QA) features

1wwwsdlcom2wwwmemoqcom

In the context of a translation company the us-age of these productivity enhacement tools is partof the whole translation workflow Project man-agers use them to generate word counts and esti-mate the time and resources needed to make thetranslation They also use them to pre-translatefiles and clean them up prior to delivery to theclient Translators use these tools to translate re-vise and proofread the translations prior to finaldelivery

Several researchers have worked on enhancingTMs using Natural Language Processing (NLP)techniques (Hodaacutesz and Pohl 2005 Pekar andMitkov 2007 Mitkov and Corpas 2008 Simardand Fujita 2012 Gupta and Orasan 2014Vanallemeersch and Vandeghinste 2015 Guptaet al 2015) Despite reporting positive resultsit seems that the gap between academic researchand industry still exists We carried out an inter-nal survey in our company to detect potential newfeatures for TMs that could be implemented byour RampD team The aim was to identify potentialnew features that project managers and translatorswished for and that would enhance their produc-tivity In this paper we report on the results of thatsurvey and further explain how we are implement-ing one of the suggestions we received The re-mainder of this paper is structured as follows Sec-tion 3 summarises the survey we carried out andthe replies we got and our company is briefly in-troduced in Section 2 Section 4 explains how weimplemented one of the suggestions we receivedSubsections 41ndash43 describe how we created newTM segments out of existing TMs Section 5 re-ports on the evaluation on a pilot test to assess thereal usability of this approach Finally section 6summarises our work and discusses future work

2 Our company

Hermes is a leader company in the translation in-dustry in Spain It was founded in 1991 and has

1

a vast experience in multilingual localisation andtranslation projects With offices in Madrid andMaacutelaga the company has broad knowledge andexperience in computer-assisted translation soft-ware and specific localisation software includingSDL Trados memoQ Deacutejagrave Vu IBM Translation-Manager Star Transit WordFast Catalyst Pas-solo Across Idiom World Server Microsoft He-lium and Microsoft Localisation Studio amongothers

Our company objectives are built upon a solidfoundation which have allowed us to achieve adouble quality certification for translation services(UNE-EN-150382006 and ISO-90012008 stan-dards) Our in-house translators and project man-agers commit daily to provide our clients with highquality translations for different specialised fields(IT medicine technical manuals general textsetc)

3 Internal survey

We asked our in-house translators and projectmanagers for potential new functionalities thatCAT tools could offer them More concretelywe asked them which was according to themthe missing functionality as far as TMs are con-cerned In total 10 project manager and 14 trans-lators participated in this internal survey Whilenot all had a clear idea of what could be imple-mented some interesting suggestions came up

We gathered ideas such as scheduling automaticTM reorganisation to prevent that large TMs endup corrupted It is a well known fact that largeTMs need to be periodically reorganized (ie re-indexed and eventually cleaned-up) While thisfeature is available in standard CAT tools such asStudio 20143 and memoQ4 it is not always car-ried out automatically and it is difficult to estimate

3In previous versions of Studio TM reorganisationwas available for all types of TMs in the Transla-tion Memory Settings Dialog Box gt Performance andTuning As of Studio 2014 file-based TMs are au-tomatically reorganised while server-based TMs re-quire periodical reorganisation For further informa-tion see httpproducthelpsdlcomSDL20Trados20Studioclient_enRefO-TTM_SettingsTranslation_Memory_Settings_Dialog_Box__Performance_and_Tuninghtm

4memoQ actually has a repairing function the Trans-lation memory repair wizard which aims at repairing(ie re-indexing) a corrupted TM According to thedocumentation it is also possible to run this functionon TMs which are not corrupted For further infor-mation see httpkilgraycommemoq2015help-enindexhtmlrepair_resource3html

when such reorganisation should be carried outScheduling it to be run automatically when a par-ticular number of segments (eg 500) have beenadded since the last reorganisation for instancewould prevent the loss of a TM because of badmaintenance

Ideas more related to NLP included the auto-matic correction of orthotypography in the tar-get language and allowing for multilingual TMswhere several source and target languages can beused for concordance searches at once Althoughcurrently it is possible to use multilingual TMs inCAT tools when starting a new project a languagepair has to be selected This leads to an underusageof the TM and the potential benefits of queryingmultiple languages at once are missed If for in-stance the TM contains a translation unit (TU) fora different pair of languages (eg German gt Ital-ian) than the ones selected for that specific project(eg German gt French) and the same source sen-tence is appearing in the text currently being trans-lated the translation into a different target lan-guage will not be shown (ie the Italian translationof the German TU will not be matched) The samewould occur with concordance searches Whilematches for a different language pair will not beused in the translation they may be useful for car-rying it out If a translator understands other targetlanguages these translations may give them a hintas to how to translate the same sentence into theirmother tongue5

Finally an interesting idea was to generate newsegments on the fly from fragments of previouslytranslated segments Flanagan (2015) offers aninteresting overview about the techniques usedby different CAT tools for subsegment matchingHere we will focus on memoQrsquos such functional-ity fragment assembly6

Figure 1 shows how this functionality worksin memoQ As may be observed memoQ looks

5For SDL Studio there seems to be an external appAnyTM that allows users to use TMs having differentlanguage pairs than the ones in the current translationproject As of Studio 2015 this app has been integrated inthe CAT tool and become a new feature However this tooldoes not seem to support the usage of truly multilingual TMs(TMs including several target languages for each segment)For further information on the tool see httpwwwtranslationzonecomopenexchangeappsdltradosstudioanytmtranslationprovider-669html)

6For further information see httpkilgraycommemoq2015help-enindexhtmlfragment_assemblyhtml

2

for fragments of the source sentence in other TMsegments and internally computes their alignmentprobabilities It then inserts the translations intothe source segment and suggests this new some-times partially translated sentence as a match Al-ternatively only the fragments translated will beinserted in the target segment one after anotherwithout the source sentence words that could notbe retrieved

Figure 1 memoQrsquos fragment assembly function-ality

One limitation of this functionality is that thefragment translations follow the order in whichthey appear in the source language Thus whileit may be very useful for pairs of languages whichfollow a similar structure it may be problematicfor pairs of languages which require reorderingmemoQ uses the frequencies of apparition of thefragments to select one translation or another foreach particular segment As a consequence insome cases the translation selection is wrong thusyielding wrong translation suggestions

Examples 1 and 2 show two similar sentencesin our TM

(1) EN The following message then appearsClick accept to run the programES Aparece el siguiente mensaje Hagaclick en aceptar para ejecutar el programa

(2) EN The window will show the followingThe application will closeES La ventana mostraraacute lo siguiente Secerraraacute la aplicacioacuten

Now imagine we need to translate the sentencein 3

(3) EN The following message then appearsThe application will close

Taking the part of 1 before the colon and the partof 2 after the colon we would be able to producethe right translation as shown in 4

(4) ES Aparece el siguiente mensaje Secerraraacute la aplicacioacuten

In technical texts it is often the case that situa-tions like the one just described happen more thanonce Thus it is not surprising that the transla-tors liked the idea and thought it would be nice tofind a way of automatically retrieving their trans-lations without having to do concordance searchesin the TM Moreover remembering that a partic-ular fragment of a segment had been translatedin the past is not always possible as translatorsmay have forgotten it or different translators mayhave been involved in the project thus not seeingfragments of a segment that other translators havetranslated already

As this idea seemed to have a great potential toincrease the number of TM and fuzzy matches wedecided to implement it and test whether it actu-ally worked The next Section (4) explains howwe proceeded

4 The new segment generator

As explained in Section 3 we decided to test oneidea originated from our internal survey The ideawas to generate new TM segments from fragmentsof already existing segments We called our newtool new segment generator

The first step was to assess the type of textsthat are translated in our company and identify thesegment fragments that could be easily extractedUpon analysis of several sample texts we identi-fied 7 different types of fragments we could workwith

1 Ellipsis2 Colons3 Parenthesis4 Square brackets5 Curly braces6 Quotations7 Text surrounded by tags

In the following subsections (41 ndash 43) we de-scribe how each of these types of fragments wastreated

3

41 Ellipsis and colonsOne possible type of segment would be that inwhich an ellipsis () or a colon () is used inthe middle of the segment In software localisationor user guides sentences such as the ones in 5 and6 could appear

(5) EN Installing new services ServiceXXXX for premium clients [2]ES Instalando nuevos servicios ServicioXXXX para clientes premium [2]

(6) EN You can use the line [abcdef] todescribe any of the following characters ab c d e fES Puede utilizar la liacutenea [abcdef] paradescribir cualquiera de los siguientescaracteres a b c d e f

If a different segment only including the textbefore the ellipsis appears as in Example 7 theTM may not retrieve any fuzzy match The samewould occur with other sentences with colons inwhich the fragment before or after the colon ap-pears

(7) Installing new services

In these cases we proceded as follows

1 Check that there is an ellipsis a colon onboth the source and the target segment

2 Split the segment in two being the first partthe fragment of the segment up to the ellip-siscolon and the second part the fragment ofthe segment after the ellipsiscolon

3 Create a new TM segment for each fragment

42 Parenthesis square brackets and curlybraces

Sometimes a sentence includes a fragment be-tween parethesis square brackets or curly bracesThe content within such characters may constitutea new segment on its own or appear in a differ-ent sentence At the same time it may also be thecase that the same sentence appears in the text butwithout such parenthesis When sentences like theones in Examples 8ndash10 appear it may thus be de-sirable to store the translation of the fragment be-tween the aforementioned characters and the sen-tence without such content

(8) EN Creates an installation package forapplication installation (if it was not createdearlier)ES Crea un paquete de instalacioacuten para lainstalacioacuten de la aplicacioacuten (si no se creoacuteantes)

(9) EN ltreturn code 1gt=[ltdescriptiongt]ES ltcoacutedigo de retorno 1gt=[ltdescripcioacutengt]

(10) EN Could not open key [2] Systemerror [3] Verify that you have sufficientaccess to that key or contact your supportpersonnelES Error al abrir la clave [2] Error en elsistema [3] Compruebe que dispone delos derechos de acceso

The strategy to create new segments was the fol-lowing

1 Check that there is content between parenthe-sis square brackets curly braces on both thesource and the target segment

2 Keep three fragments out of each sentence

(a) A sentence removing those charactersand the content between them

(b) A fragment starting at the opening char-acter and finishing on the closing oneand including the content within In thisfragment the parenthesis square brack-ets or curly braces are mantained

(c) A fragment containing only the con-tent withing those characters (parenthe-sis square brackets or curly braces) butwithout them

3 Create a new TM segment for each fragment

At this preliminary stage we considered thatwhen a sentence has several clauses in parenthesissquare brackets or curly braces these appear in thesame order in the target language This was doneso because for the type of texts used so far to testour application (software manuals) and the pair oflanguages used (English into Spanish) this seemsto be the usual case In future work we plan to fur-ther evaluate this issue and consider other waysof ensuring that the right translation is assigned toeach clause

4

43 Quotations and text within tagsQuotations and double tags appearing in the textwere handled differently As the text within thequotations or tags might be part of the sentencewhere it appears it could not be removed withoutadding too much noise to the data Thus we iden-tified sentences with quotations andor tags wethen removed the quotations andor tags and keptthe same sentence without them as a new segmentFinally we also kept the text within the quotationmarks or tags as new segments Examples 11ndash12illustrate this kind of segments

(11) EN You can only set the values of settingsthat the policy allows to be modified that isunlocked settingsES Solo se pueden establecer los valoresde los paraacutemetros que al directiva permitemodificar es decir los paraacutemetrosdesbloqueados

(12) EN If you clear the lt1gtInherit settingsfrom parent policylt1gt check box in thelt2gtSettings inheritancelt2gt section of thelt3gtGenerallt3gt section in the propertieswindow of an inherited policy the lockis lifted for that policyES Si anula la seleccioacuten de la casillalt1gtHeredar configuracioacuten de la directivaprimarialt1gt en la seccioacuten lt2gtHerencia deconfiguracioacutenlt2gt que aparece en la seccioacutenlt3gtGenerallt3gt de la ventana depropiedades de una directiva heredada seabriraacute el candado para esa directiva

5 Pilot test

Before integrating our system in a CAT tool andin our normal production workflows we deemedit better to run a pilot test The aim of this testwas to measure to which extent the new segmentsretrieved an increased number of 100 and fuzzymatches

51 Test setWe used as a test set a real translation project com-ing from one of our clients It is a software manualwritten in English and to be translated into Span-ish We selected memoQ 2015 to be the CAT toolused for our testing because it is one of the com-mon CAT tools used by our translators and be-cause we also wanted to measure the impact of

our approach when using its fragment assemblyfunctionality

The project had in total 425 segments account-ing for 6280 words according to memoQ Ta-ble 1 shows the project statistics as provided bymemoQrsquos analysis tool using the project TM pro-vided by the client Additionally memoQ iden-tified 36 segments (418 words) which could betranslated benefiting from its fragment assemblyfunctionality which uses fragments of segments tocreate new translations

TM match Words SegmentsRepetitions 1064 80

100 0 095-99 4 285-94 0 075-84 285 14

50-74 2523 187No Match 2404 142

Total 6280 425

Table 1 Project statistics according to memoQ us-ing the project TM

Taking this analysis as the starting point of ourpilot test we generated new segments using theapproach described in Section 4 We used threedifferent TMs to further assess whether the sizeof the translation memory matters for generatingtranslations of segments and retrieving more trans-lations The first TM (Project TM) was the projectTM provided by the client The second TM (Prod-uct TM) was a TM including all projects done forthe same product of the client Finally the thirdTM (Client TM) included all projects of that clientand thus was the biggest one Table 2 summarisesthe size of the three TMs

Segments WordsEN ES

Project TM 16842 212472 244159456631

Product TM 20923 274542 317797592339

Client TM 256099 3427861 39517327379593

Table 2 Size of the different TMs used

We then generated new TM segments and storedthem as new TMs Table 3 shows the number ofnew segments generated using our approach

Table 4 breaks down the number of segmentsgenerated using each strategy and for each TM

5

Segments EN ESProject TM 6776 56973 66297

new segments 123270Product TM 7760 71034 83125

new segments 154159new Client TM 74041 662714 769705new segments 1432419

Table 3 Size of the new TMs generated using ourapproach

used As can be observed some types are moreproductive than others When using the smallerTMs (project and product) the most prolific seg-ment generator category was the one which ex-tracted text surrounded by tags However whenusing the whole clientrsquos TM the text betweenparenthesis was more prolific

TM Proj TM Prod TM clientEllipsis 7 6 50Colon 1801 2094 17894Parenthesis 1361 1621 29637Square bra 78 73 598Curly braces 0 0 0Quotations 1085 1146 8797Tags 2523 2892 17792

Table 4 Number of newly generated segments pertype and TM used

We then tested how many segments would beretrieved using our newly created TMs both aloneand in combination with the TMs we previouslyhad MemoQ offers the possibility of activatingand deactivating different TMs when preparing afile for translation We thus prepared the projectfile using 11 combinations to assess which com-bination performed better as well as whether thenew TMs where having any impact in the projectThese 11 scenarios were the following

1 TM1 The project TM as provided by theclient

2 TM2 Only the new segments generatedfrom the project TM provided by the client

3 TM3 A combination of the project TM andthe new segments retrieved from it

4 TM4 Only the new segments generatedfrom the product TM

5 TM5 The project TM and the new segmentsgenerated from the product TM

6 TM6 The project TM combined withthe new segments TMs generated from theproject TM and the ones from the productTM

7 TM7 Only the new segments generatedfrom the client TM

8 TM8 The project TM combined with thenew segments generated from the client TM

9 TM9 The project TM combined with thenew TMs generated from the client TM andthe ones from the project TM

10 TM10 The project TM combined with thenew segments generated from the client TMand the ones from the product TM

11 TM11 The project TM combined with thenew segments generated from the client TMthe ones from the project TM and the onesfrom the product TM

The preparation of a file for translation typ-ically includes both analysing the file and pre-translating it When the fragment assembly func-tionality from memoQ is activated informationabout how many segments could be translated us-ing fragments is also provided Tables 5 and 6summarise the results we obtained for each TMenvironment when preparing the project for trans-lation

As can be observed using the TMs with newsegments decreased in all cases the number of seg-ments not started and increased the number of seg-ments translated using fragments (cf Table 5) Italso seems clear that size matters and that the big-ger the TM with new fragments the higher thenumber of segments that benefit from fragments(cf TM2 TM4 and TM7 in Table 5)

However this does not hold true for the pre-translation In all cases in which the new TMswere used in isolation (TM2 TM4 and TM7) thenumber of pretranslated segments decreases Thismay be due to the fact that those TMs only containfragments of the original segments present in thedifferent TMs used to generate the segments

When combined with the project TM the num-ber of pretranslated segments increases (cf TM3TM5 TM6 and TM8-TM11) The best overall re-sults are obtained when using the project TM ei-ther in combination with both the new TM gener-ated from the project TM and the new TM gen-

6

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194

Table 5 Overview on the number of segments pre-translated translated using fragments or not started

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11

100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87

Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band

erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)

If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band

In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations

from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators

6 Conclusion and future work

In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent

We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators

We have also envisioned the combination ofalready existing methods for retrieving a higher

7

number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)

Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesKevin Flanagan 2015 Subsegment recall in Transla-

tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January

Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)

Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT

Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria

Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008

Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters

Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas

Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT

8

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015

Spotting false translation segments in translation memories

Eduard BarbuTranslatednet

eduardtranslatednet

Abstract

The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories

1 Introduction

MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface

The quality of the translations using the firstmethod is high and the errors are relatively few

1httpsmymemorytranslatednet

6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V

Figure 1 The distribution of automatically ac-quired memories in MyMemory

However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions

The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task

The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments

9

2 Related Work

The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval

There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora

A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours

2httpwwwxbenchnet

3 The data

In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types

1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web

2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)

3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments

4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)

The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors

10

It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1

CG =ls minus ldradic

34(ls + ld)(1)

In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments

If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426

The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the

3This simple idea is implemented in many sentence align-ers

4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom

Church Gale ScoreP

rob

ab

ility C

hu

rch

Ga

le S

co

re

minus4 minus2 0 2 4

00

02

04

06

08

Figure 2 The distribution of Church Gale Scoresin the Matecat Set

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

01

02

03

04

Figure 3 The distribution of Church Gale Scoresin the Collaborative Set

11

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

00

00

04

00

08

Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet

Church Gale score varies in the interval minus131516015

To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6

In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows

bull Training Set It contains 1243 bi-segmentsand has 373 negative example

bull Test Set It contains 309 bi-segments and has87 negatives examples

The proportion of the negative examples in bothsets is approximately 30

minus4 minus2 0 2 4

minus4

minus2

02

4Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 5 The Q-Q plot for the Matecat set

minus4 minus2 0 2 4

minus1

00

minus5

00

50

Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 6 The Q-Q plot for the Collaborative set

12

4 Machine Learning

In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them

41 FeaturesThe features computed for the training and test setare the following

bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt

bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language

bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0

bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0

bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0

bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0

bull has capital letters The value of the featureis 1 if the source or target segments contain

words that have at least a capital letter other-wise is 0

bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters

bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions

bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)

bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties

bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity

bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity

13

bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment

bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters

bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters

bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2

All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation

The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment

is done with the highly accurate language detectorCybozu5

42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)

bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree

bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees

bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores

bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms

bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances

bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can

5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd

14

give good results in classification problemswhere the decision boundary is irregular

5 Results and discussion

We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1

Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071

Table 1 The results of the three-fold stratifiedclassification

Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)

The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2

The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set

Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059

Table 2 The results of the classification on the testset

Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section

6 Conclusions and further work

In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the

15

performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database

There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative

Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471

ReferencesWilliam A Gale and Kenneth W Church 1993 A

program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS

B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April

Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-

lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31

Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830

Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226

Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory

Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas

16

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015

Abstract

We propose the integration of clause

splitting as a pre-processing step for match

retrieval in Translation Memory (TM)

systems to increase the number of relevant

sub-segment matches Through a series of

experiments we investigate the impact of

clause splitting in instances where the input

does not match an entire segment in the

TM but only a clause from a segment Our

results show that there is a statistically

significant increase in the number of

retrieved matches when both the input

segments and the segments in the TM are

first processed with a clause splitter

1 Rationale

Translation memory tools have had a great

impact on the translation industry as they

provide considerable assistance to translators

They allow translators to easily re-use

previous translations providing them with

valuable productivity gains in an industry

where there is a great demand for quality

translation delivered in the shortest possible

time However existing tools have some

shortcomings The majority of existing tools

rely on Levenshtein distance and seek to

identify matches only at the sentence level

Semantically similar segments are therefore

difficult to retrieve if the string similarity is not

high enough as are sub-segment matches

because if only part of a sentence matches the

input even if this part is an entire clause it is

unlikely that this sentence would be retrieved

(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text

types such as updated versions of technical

manuals

In this study we aim to address the

problem of retrieving sub-segment matches by

performing clause splitting on the source

segment as a pre-processing step for TM match

retrieval While matches for entire sentences

or almost entire sentences are the most useful

type of matches it is also less likely for such

matches to be found in most text types and

even less so for complex sentences Retrieving

clauses is desirable because there is a higher

chance for a match to be found for a clause than

for a complex sentence and at the same time

clauses are similar to sentences in that they

both contain a subject and a verb hence a

ldquocomplete thoughtrdquo therefore clause matches

are more likely to be in context and to actually

be used by the translator than phrase matches

for example

We perform experiments comparing the

match retrieval performance of TM tools when

they are used as is and when the input file and

the TM segments are first processed with a

clause splitter before being fed into the TM

tool The paper is organised as follows In

section 2 we discuss related work on TM

matching In section 3 we discuss how clause

splitting can be beneficial to TM matching and

how clause splitting was implemented for this

study We then describe our experiments in

section 4 and discuss the results in section 5

Finally we present our conclusion and future

work in section 6

Improving Translation Memory Matching through Clause

Splitting

Katerina Timonera

Research Group in Computational

Linguistics

University of Wolverhampton

krtimoneragmailcom

Ruslan Mitkov

Research Group in Computational

Linguistics

University of Wolverhampton

rmitkovwlvacuk

17

2 Related Work

Attempts to address the shortcomings of

existing tools include the integration of

language processing to break down a sentence

into smaller segments The so-called lsquosecond

generationrsquo TM system Similis (Planas 2005)

performs chunking to split sentences into syntagmas to allow sub-sentence matching

However Reinke (2013) observes that for

certain language pairs like English-German

only rather short phrases like simple NPs are

identified and larger syntactic units cannot be

retrieved This can be regarded as disadvantage

as the processing of larger units would be

desirable for the support of professional

computer-assisted human translation (Kriele

(2006) and Macken (2009) cited in Reinke

2013)

MetaMorphoTM (Hodaacutesz and Pohl 2005)

also divides sentences into smaller chunks Moreover it uses a multi-level linguistic

similarity technique (surface form lemma

word class) to determine similarity between

two source-language segments

Other attempts involve deeper linguistic

processing techniques In Pekar and Mitkov

(2007) we propose the lsquothird-generation

translation memoryrsquo which introduces the

concept of semantic matching We employ

syntactic and semantic analysis of segments

stored in a TM to produce a generalised

representation of segments which reduces

equivalent lexical syntactic and lexico-

syntactic constructions into a single

representation Then a retrieval mechanism

operating on these generalised

representations is used to search for useful

previous translations in the TM

This study is part of the third-generation

translation memory project of which the

ultimate goal is to produce more intelligent TM

systems using NLP techniques To the best of

our knowledge clause splitting has not

previously been investigated as a possible

method for increasing the number of relevant

retrieved matches

3 Clause splitting for TM matching

Macklovitch and Russel (2000) note that

when a sufficiently close match cannot be

found for a new input sentence current TM

systems are unable to retrieve sentences that

contain the same clauses or other major

phrases

For example (a) below is a new input

sentence composed of twenty five-character

words The TM contains the sentence (b)

which shares an identical substring with

sentence (a) However as this substring only

makes up only 25 of the sentences total

number of characters it is unlikely that current

TM tools would be able to retrieve it as a fuzzy

match

(a) w1 w2 w3 w4 w5 w6 w20

(b) w1 w2 w3 w4 w5 w21 w35

If clause splitting were employed clauses

would be treated as separate segments thus

increasing the likelihood that clauses which

are subparts of larger units could have a match

score sufficiently high to be retrieved by the

TM system

In this paper we compare the effect of

performing clause splitting before retrieving

matches in a TM In each experiment we

identify the difference in matching

performance when a TM tool is used as is and

when the input file and the translation memory

are first run through a clause splitter The

clause splitter we use in this study is a modified version of the one described in Puscasu (2004)

The original version employs both machine

learning and linguistic rules to identify finite

clauses for both English and Romanian but in

this version only the rule-based module is

used Puscasu (2004) developed a clause

splitting method for both English and

Romanian and to maintain consistency

between the two languages her definition of a

clause is the one prescribed by the Romanian

Academy of Grammar which is that a clause is

18

group of words containing a finite verb

Nonfinite and verbless clauses are therefore

not considered The reported F-measure for

identifying complete clauses in English is

8139 (Marsic 2011)

In this study the clause splitter is used on

both the segments in the input file and the translation memory database After processing

the input and the TM segments with the clause

splitter these were then imported into existing

TM tools to examine how well these tools will

perform if clause splitting is used in pre-

processing

4 Experiments

Experiments were performed to study the

impact of clause splitting when used in pre-

processing for the retrieval of segments in a

TM Our hypothesis is that when a clause

splitter is used the number of relevant

retrieved matches will increase

The effect of clause splitting is examined by

comparing the number of matches retrieved

when TM tools are used as is and when both

the input segments and the segments in the

translation memory are first processed with a

clause splitter before being imported into the

TM tools The tools used are Wordfast

Professional 3 and Trados Studio 2009 which

are among the most widely used TM tools

(Lagoudaki 2006)

Segments used as the input were selected

from the Edinburgh paraphrase corpus (Cohn

Callison-Burch and Lapata 2008) (in Macken

2009) We use a paraphrase corpus because we

wish to investigate the effect of using a clause

splitter in pre-processing to retrieve both

segments that contain the entire input clause

and segments that do not contain the exact

input clause but may still be relevant as they

contain a clause that shares a considerable

degree of similarity with the input

We examine the segments retrieved using

both the default fuzzy match threshold (75

for Wordfast and 70 for SDL Trados) and the

minimum threshold (40 for Wordfast and

30 for Trados) It is not normally

recommended for translators to set a low fuzzy

match threshold as this might result in the

retrieval of too many irrelevant segments if the

translation memory is large However in this

study we argue it would be beneficial to

examine matches retrieved with the minimum

threshold as well Given that translation

memory match scores are mainly calculated

using Levenshtein distance if only one clause

in a segment in the TM matches the input there

is a greater chance of the segment being

retrieved with a lower threshold We therefore

wish to examine whether the employment of

clause splitting will still result in a considerable

improvement from using the Levenshtein

distance-based matching algorithm in most TM

tools if the match threshold setting is already

optimised for the retrieval of sub-segment

clauses

It must also be noted that for this study we

are working with the source segments only

Therefore in the TM files used both the source

and target segments are in English

We conducted two main sets of experiments

referred to as Set A and Set B which are

outlined below In Set A we selected sentences

from the Edinburgh corpus that contained

more than one clause We use one clause or

part of it as the input segment and we store

the entire sentence in the TM In the

experiments where no clause splitting is done

the sentence is stored as is In the experiments

with clause splitting the original input

segments are split into clauses (if there are more than one) and the segments in the TM are

the component clauses of the original sentence

For the experiments done without clause

splitting there are 150 input segments and for

each one we test whether the longer

corresponding segment in the TM can be

retrieved For the experiments where clause

splitting is used the 150 input segments are

split into 180 segments as some of these

segments contain more than one clause We

then test whether the corresponding clause

from the original longer sentence can be

retrieved An example is presented in Table 1

19

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 9: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

a vast experience in multilingual localisation andtranslation projects With offices in Madrid andMaacutelaga the company has broad knowledge andexperience in computer-assisted translation soft-ware and specific localisation software includingSDL Trados memoQ Deacutejagrave Vu IBM Translation-Manager Star Transit WordFast Catalyst Pas-solo Across Idiom World Server Microsoft He-lium and Microsoft Localisation Studio amongothers

Our company objectives are built upon a solidfoundation which have allowed us to achieve adouble quality certification for translation services(UNE-EN-150382006 and ISO-90012008 stan-dards) Our in-house translators and project man-agers commit daily to provide our clients with highquality translations for different specialised fields(IT medicine technical manuals general textsetc)

3 Internal survey

We asked our in-house translators and projectmanagers for potential new functionalities thatCAT tools could offer them More concretelywe asked them which was according to themthe missing functionality as far as TMs are con-cerned In total 10 project manager and 14 trans-lators participated in this internal survey Whilenot all had a clear idea of what could be imple-mented some interesting suggestions came up

We gathered ideas such as scheduling automaticTM reorganisation to prevent that large TMs endup corrupted It is a well known fact that largeTMs need to be periodically reorganized (ie re-indexed and eventually cleaned-up) While thisfeature is available in standard CAT tools such asStudio 20143 and memoQ4 it is not always car-ried out automatically and it is difficult to estimate

3In previous versions of Studio TM reorganisationwas available for all types of TMs in the Transla-tion Memory Settings Dialog Box gt Performance andTuning As of Studio 2014 file-based TMs are au-tomatically reorganised while server-based TMs re-quire periodical reorganisation For further informa-tion see httpproducthelpsdlcomSDL20Trados20Studioclient_enRefO-TTM_SettingsTranslation_Memory_Settings_Dialog_Box__Performance_and_Tuninghtm

4memoQ actually has a repairing function the Trans-lation memory repair wizard which aims at repairing(ie re-indexing) a corrupted TM According to thedocumentation it is also possible to run this functionon TMs which are not corrupted For further infor-mation see httpkilgraycommemoq2015help-enindexhtmlrepair_resource3html

when such reorganisation should be carried outScheduling it to be run automatically when a par-ticular number of segments (eg 500) have beenadded since the last reorganisation for instancewould prevent the loss of a TM because of badmaintenance

Ideas more related to NLP included the auto-matic correction of orthotypography in the tar-get language and allowing for multilingual TMswhere several source and target languages can beused for concordance searches at once Althoughcurrently it is possible to use multilingual TMs inCAT tools when starting a new project a languagepair has to be selected This leads to an underusageof the TM and the potential benefits of queryingmultiple languages at once are missed If for in-stance the TM contains a translation unit (TU) fora different pair of languages (eg German gt Ital-ian) than the ones selected for that specific project(eg German gt French) and the same source sen-tence is appearing in the text currently being trans-lated the translation into a different target lan-guage will not be shown (ie the Italian translationof the German TU will not be matched) The samewould occur with concordance searches Whilematches for a different language pair will not beused in the translation they may be useful for car-rying it out If a translator understands other targetlanguages these translations may give them a hintas to how to translate the same sentence into theirmother tongue5

Finally an interesting idea was to generate newsegments on the fly from fragments of previouslytranslated segments Flanagan (2015) offers aninteresting overview about the techniques usedby different CAT tools for subsegment matchingHere we will focus on memoQrsquos such functional-ity fragment assembly6

Figure 1 shows how this functionality worksin memoQ As may be observed memoQ looks

5For SDL Studio there seems to be an external appAnyTM that allows users to use TMs having differentlanguage pairs than the ones in the current translationproject As of Studio 2015 this app has been integrated inthe CAT tool and become a new feature However this tooldoes not seem to support the usage of truly multilingual TMs(TMs including several target languages for each segment)For further information on the tool see httpwwwtranslationzonecomopenexchangeappsdltradosstudioanytmtranslationprovider-669html)

6For further information see httpkilgraycommemoq2015help-enindexhtmlfragment_assemblyhtml

2

for fragments of the source sentence in other TMsegments and internally computes their alignmentprobabilities It then inserts the translations intothe source segment and suggests this new some-times partially translated sentence as a match Al-ternatively only the fragments translated will beinserted in the target segment one after anotherwithout the source sentence words that could notbe retrieved

Figure 1 memoQrsquos fragment assembly function-ality

One limitation of this functionality is that thefragment translations follow the order in whichthey appear in the source language Thus whileit may be very useful for pairs of languages whichfollow a similar structure it may be problematicfor pairs of languages which require reorderingmemoQ uses the frequencies of apparition of thefragments to select one translation or another foreach particular segment As a consequence insome cases the translation selection is wrong thusyielding wrong translation suggestions

Examples 1 and 2 show two similar sentencesin our TM

(1) EN The following message then appearsClick accept to run the programES Aparece el siguiente mensaje Hagaclick en aceptar para ejecutar el programa

(2) EN The window will show the followingThe application will closeES La ventana mostraraacute lo siguiente Secerraraacute la aplicacioacuten

Now imagine we need to translate the sentencein 3

(3) EN The following message then appearsThe application will close

Taking the part of 1 before the colon and the partof 2 after the colon we would be able to producethe right translation as shown in 4

(4) ES Aparece el siguiente mensaje Secerraraacute la aplicacioacuten

In technical texts it is often the case that situa-tions like the one just described happen more thanonce Thus it is not surprising that the transla-tors liked the idea and thought it would be nice tofind a way of automatically retrieving their trans-lations without having to do concordance searchesin the TM Moreover remembering that a partic-ular fragment of a segment had been translatedin the past is not always possible as translatorsmay have forgotten it or different translators mayhave been involved in the project thus not seeingfragments of a segment that other translators havetranslated already

As this idea seemed to have a great potential toincrease the number of TM and fuzzy matches wedecided to implement it and test whether it actu-ally worked The next Section (4) explains howwe proceeded

4 The new segment generator

As explained in Section 3 we decided to test oneidea originated from our internal survey The ideawas to generate new TM segments from fragmentsof already existing segments We called our newtool new segment generator

The first step was to assess the type of textsthat are translated in our company and identify thesegment fragments that could be easily extractedUpon analysis of several sample texts we identi-fied 7 different types of fragments we could workwith

1 Ellipsis2 Colons3 Parenthesis4 Square brackets5 Curly braces6 Quotations7 Text surrounded by tags

In the following subsections (41 ndash 43) we de-scribe how each of these types of fragments wastreated

3

41 Ellipsis and colonsOne possible type of segment would be that inwhich an ellipsis () or a colon () is used inthe middle of the segment In software localisationor user guides sentences such as the ones in 5 and6 could appear

(5) EN Installing new services ServiceXXXX for premium clients [2]ES Instalando nuevos servicios ServicioXXXX para clientes premium [2]

(6) EN You can use the line [abcdef] todescribe any of the following characters ab c d e fES Puede utilizar la liacutenea [abcdef] paradescribir cualquiera de los siguientescaracteres a b c d e f

If a different segment only including the textbefore the ellipsis appears as in Example 7 theTM may not retrieve any fuzzy match The samewould occur with other sentences with colons inwhich the fragment before or after the colon ap-pears

(7) Installing new services

In these cases we proceded as follows

1 Check that there is an ellipsis a colon onboth the source and the target segment

2 Split the segment in two being the first partthe fragment of the segment up to the ellip-siscolon and the second part the fragment ofthe segment after the ellipsiscolon

3 Create a new TM segment for each fragment

42 Parenthesis square brackets and curlybraces

Sometimes a sentence includes a fragment be-tween parethesis square brackets or curly bracesThe content within such characters may constitutea new segment on its own or appear in a differ-ent sentence At the same time it may also be thecase that the same sentence appears in the text butwithout such parenthesis When sentences like theones in Examples 8ndash10 appear it may thus be de-sirable to store the translation of the fragment be-tween the aforementioned characters and the sen-tence without such content

(8) EN Creates an installation package forapplication installation (if it was not createdearlier)ES Crea un paquete de instalacioacuten para lainstalacioacuten de la aplicacioacuten (si no se creoacuteantes)

(9) EN ltreturn code 1gt=[ltdescriptiongt]ES ltcoacutedigo de retorno 1gt=[ltdescripcioacutengt]

(10) EN Could not open key [2] Systemerror [3] Verify that you have sufficientaccess to that key or contact your supportpersonnelES Error al abrir la clave [2] Error en elsistema [3] Compruebe que dispone delos derechos de acceso

The strategy to create new segments was the fol-lowing

1 Check that there is content between parenthe-sis square brackets curly braces on both thesource and the target segment

2 Keep three fragments out of each sentence

(a) A sentence removing those charactersand the content between them

(b) A fragment starting at the opening char-acter and finishing on the closing oneand including the content within In thisfragment the parenthesis square brack-ets or curly braces are mantained

(c) A fragment containing only the con-tent withing those characters (parenthe-sis square brackets or curly braces) butwithout them

3 Create a new TM segment for each fragment

At this preliminary stage we considered thatwhen a sentence has several clauses in parenthesissquare brackets or curly braces these appear in thesame order in the target language This was doneso because for the type of texts used so far to testour application (software manuals) and the pair oflanguages used (English into Spanish) this seemsto be the usual case In future work we plan to fur-ther evaluate this issue and consider other waysof ensuring that the right translation is assigned toeach clause

4

43 Quotations and text within tagsQuotations and double tags appearing in the textwere handled differently As the text within thequotations or tags might be part of the sentencewhere it appears it could not be removed withoutadding too much noise to the data Thus we iden-tified sentences with quotations andor tags wethen removed the quotations andor tags and keptthe same sentence without them as a new segmentFinally we also kept the text within the quotationmarks or tags as new segments Examples 11ndash12illustrate this kind of segments

(11) EN You can only set the values of settingsthat the policy allows to be modified that isunlocked settingsES Solo se pueden establecer los valoresde los paraacutemetros que al directiva permitemodificar es decir los paraacutemetrosdesbloqueados

(12) EN If you clear the lt1gtInherit settingsfrom parent policylt1gt check box in thelt2gtSettings inheritancelt2gt section of thelt3gtGenerallt3gt section in the propertieswindow of an inherited policy the lockis lifted for that policyES Si anula la seleccioacuten de la casillalt1gtHeredar configuracioacuten de la directivaprimarialt1gt en la seccioacuten lt2gtHerencia deconfiguracioacutenlt2gt que aparece en la seccioacutenlt3gtGenerallt3gt de la ventana depropiedades de una directiva heredada seabriraacute el candado para esa directiva

5 Pilot test

Before integrating our system in a CAT tool andin our normal production workflows we deemedit better to run a pilot test The aim of this testwas to measure to which extent the new segmentsretrieved an increased number of 100 and fuzzymatches

51 Test setWe used as a test set a real translation project com-ing from one of our clients It is a software manualwritten in English and to be translated into Span-ish We selected memoQ 2015 to be the CAT toolused for our testing because it is one of the com-mon CAT tools used by our translators and be-cause we also wanted to measure the impact of

our approach when using its fragment assemblyfunctionality

The project had in total 425 segments account-ing for 6280 words according to memoQ Ta-ble 1 shows the project statistics as provided bymemoQrsquos analysis tool using the project TM pro-vided by the client Additionally memoQ iden-tified 36 segments (418 words) which could betranslated benefiting from its fragment assemblyfunctionality which uses fragments of segments tocreate new translations

TM match Words SegmentsRepetitions 1064 80

100 0 095-99 4 285-94 0 075-84 285 14

50-74 2523 187No Match 2404 142

Total 6280 425

Table 1 Project statistics according to memoQ us-ing the project TM

Taking this analysis as the starting point of ourpilot test we generated new segments using theapproach described in Section 4 We used threedifferent TMs to further assess whether the sizeof the translation memory matters for generatingtranslations of segments and retrieving more trans-lations The first TM (Project TM) was the projectTM provided by the client The second TM (Prod-uct TM) was a TM including all projects done forthe same product of the client Finally the thirdTM (Client TM) included all projects of that clientand thus was the biggest one Table 2 summarisesthe size of the three TMs

Segments WordsEN ES

Project TM 16842 212472 244159456631

Product TM 20923 274542 317797592339

Client TM 256099 3427861 39517327379593

Table 2 Size of the different TMs used

We then generated new TM segments and storedthem as new TMs Table 3 shows the number ofnew segments generated using our approach

Table 4 breaks down the number of segmentsgenerated using each strategy and for each TM

5

Segments EN ESProject TM 6776 56973 66297

new segments 123270Product TM 7760 71034 83125

new segments 154159new Client TM 74041 662714 769705new segments 1432419

Table 3 Size of the new TMs generated using ourapproach

used As can be observed some types are moreproductive than others When using the smallerTMs (project and product) the most prolific seg-ment generator category was the one which ex-tracted text surrounded by tags However whenusing the whole clientrsquos TM the text betweenparenthesis was more prolific

TM Proj TM Prod TM clientEllipsis 7 6 50Colon 1801 2094 17894Parenthesis 1361 1621 29637Square bra 78 73 598Curly braces 0 0 0Quotations 1085 1146 8797Tags 2523 2892 17792

Table 4 Number of newly generated segments pertype and TM used

We then tested how many segments would beretrieved using our newly created TMs both aloneand in combination with the TMs we previouslyhad MemoQ offers the possibility of activatingand deactivating different TMs when preparing afile for translation We thus prepared the projectfile using 11 combinations to assess which com-bination performed better as well as whether thenew TMs where having any impact in the projectThese 11 scenarios were the following

1 TM1 The project TM as provided by theclient

2 TM2 Only the new segments generatedfrom the project TM provided by the client

3 TM3 A combination of the project TM andthe new segments retrieved from it

4 TM4 Only the new segments generatedfrom the product TM

5 TM5 The project TM and the new segmentsgenerated from the product TM

6 TM6 The project TM combined withthe new segments TMs generated from theproject TM and the ones from the productTM

7 TM7 Only the new segments generatedfrom the client TM

8 TM8 The project TM combined with thenew segments generated from the client TM

9 TM9 The project TM combined with thenew TMs generated from the client TM andthe ones from the project TM

10 TM10 The project TM combined with thenew segments generated from the client TMand the ones from the product TM

11 TM11 The project TM combined with thenew segments generated from the client TMthe ones from the project TM and the onesfrom the product TM

The preparation of a file for translation typ-ically includes both analysing the file and pre-translating it When the fragment assembly func-tionality from memoQ is activated informationabout how many segments could be translated us-ing fragments is also provided Tables 5 and 6summarise the results we obtained for each TMenvironment when preparing the project for trans-lation

As can be observed using the TMs with newsegments decreased in all cases the number of seg-ments not started and increased the number of seg-ments translated using fragments (cf Table 5) Italso seems clear that size matters and that the big-ger the TM with new fragments the higher thenumber of segments that benefit from fragments(cf TM2 TM4 and TM7 in Table 5)

However this does not hold true for the pre-translation In all cases in which the new TMswere used in isolation (TM2 TM4 and TM7) thenumber of pretranslated segments decreases Thismay be due to the fact that those TMs only containfragments of the original segments present in thedifferent TMs used to generate the segments

When combined with the project TM the num-ber of pretranslated segments increases (cf TM3TM5 TM6 and TM8-TM11) The best overall re-sults are obtained when using the project TM ei-ther in combination with both the new TM gener-ated from the project TM and the new TM gen-

6

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194

Table 5 Overview on the number of segments pre-translated translated using fragments or not started

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11

100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87

Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band

erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)

If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band

In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations

from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators

6 Conclusion and future work

In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent

We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators

We have also envisioned the combination ofalready existing methods for retrieving a higher

7

number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)

Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesKevin Flanagan 2015 Subsegment recall in Transla-

tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January

Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)

Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT

Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria

Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008

Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters

Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas

Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT

8

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015

Spotting false translation segments in translation memories

Eduard BarbuTranslatednet

eduardtranslatednet

Abstract

The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories

1 Introduction

MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface

The quality of the translations using the firstmethod is high and the errors are relatively few

1httpsmymemorytranslatednet

6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V

Figure 1 The distribution of automatically ac-quired memories in MyMemory

However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions

The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task

The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments

9

2 Related Work

The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval

There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora

A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours

2httpwwwxbenchnet

3 The data

In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types

1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web

2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)

3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments

4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)

The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors

10

It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1

CG =ls minus ldradic

34(ls + ld)(1)

In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments

If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426

The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the

3This simple idea is implemented in many sentence align-ers

4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom

Church Gale ScoreP

rob

ab

ility C

hu

rch

Ga

le S

co

re

minus4 minus2 0 2 4

00

02

04

06

08

Figure 2 The distribution of Church Gale Scoresin the Matecat Set

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

01

02

03

04

Figure 3 The distribution of Church Gale Scoresin the Collaborative Set

11

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

00

00

04

00

08

Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet

Church Gale score varies in the interval minus131516015

To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6

In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows

bull Training Set It contains 1243 bi-segmentsand has 373 negative example

bull Test Set It contains 309 bi-segments and has87 negatives examples

The proportion of the negative examples in bothsets is approximately 30

minus4 minus2 0 2 4

minus4

minus2

02

4Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 5 The Q-Q plot for the Matecat set

minus4 minus2 0 2 4

minus1

00

minus5

00

50

Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 6 The Q-Q plot for the Collaborative set

12

4 Machine Learning

In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them

41 FeaturesThe features computed for the training and test setare the following

bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt

bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language

bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0

bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0

bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0

bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0

bull has capital letters The value of the featureis 1 if the source or target segments contain

words that have at least a capital letter other-wise is 0

bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters

bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions

bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)

bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties

bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity

bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity

13

bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment

bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters

bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters

bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2

All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation

The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment

is done with the highly accurate language detectorCybozu5

42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)

bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree

bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees

bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores

bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms

bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances

bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can

5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd

14

give good results in classification problemswhere the decision boundary is irregular

5 Results and discussion

We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1

Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071

Table 1 The results of the three-fold stratifiedclassification

Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)

The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2

The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set

Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059

Table 2 The results of the classification on the testset

Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section

6 Conclusions and further work

In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the

15

performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database

There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative

Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471

ReferencesWilliam A Gale and Kenneth W Church 1993 A

program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS

B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April

Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-

lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31

Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830

Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226

Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory

Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas

16

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015

Abstract

We propose the integration of clause

splitting as a pre-processing step for match

retrieval in Translation Memory (TM)

systems to increase the number of relevant

sub-segment matches Through a series of

experiments we investigate the impact of

clause splitting in instances where the input

does not match an entire segment in the

TM but only a clause from a segment Our

results show that there is a statistically

significant increase in the number of

retrieved matches when both the input

segments and the segments in the TM are

first processed with a clause splitter

1 Rationale

Translation memory tools have had a great

impact on the translation industry as they

provide considerable assistance to translators

They allow translators to easily re-use

previous translations providing them with

valuable productivity gains in an industry

where there is a great demand for quality

translation delivered in the shortest possible

time However existing tools have some

shortcomings The majority of existing tools

rely on Levenshtein distance and seek to

identify matches only at the sentence level

Semantically similar segments are therefore

difficult to retrieve if the string similarity is not

high enough as are sub-segment matches

because if only part of a sentence matches the

input even if this part is an entire clause it is

unlikely that this sentence would be retrieved

(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text

types such as updated versions of technical

manuals

In this study we aim to address the

problem of retrieving sub-segment matches by

performing clause splitting on the source

segment as a pre-processing step for TM match

retrieval While matches for entire sentences

or almost entire sentences are the most useful

type of matches it is also less likely for such

matches to be found in most text types and

even less so for complex sentences Retrieving

clauses is desirable because there is a higher

chance for a match to be found for a clause than

for a complex sentence and at the same time

clauses are similar to sentences in that they

both contain a subject and a verb hence a

ldquocomplete thoughtrdquo therefore clause matches

are more likely to be in context and to actually

be used by the translator than phrase matches

for example

We perform experiments comparing the

match retrieval performance of TM tools when

they are used as is and when the input file and

the TM segments are first processed with a

clause splitter before being fed into the TM

tool The paper is organised as follows In

section 2 we discuss related work on TM

matching In section 3 we discuss how clause

splitting can be beneficial to TM matching and

how clause splitting was implemented for this

study We then describe our experiments in

section 4 and discuss the results in section 5

Finally we present our conclusion and future

work in section 6

Improving Translation Memory Matching through Clause

Splitting

Katerina Timonera

Research Group in Computational

Linguistics

University of Wolverhampton

krtimoneragmailcom

Ruslan Mitkov

Research Group in Computational

Linguistics

University of Wolverhampton

rmitkovwlvacuk

17

2 Related Work

Attempts to address the shortcomings of

existing tools include the integration of

language processing to break down a sentence

into smaller segments The so-called lsquosecond

generationrsquo TM system Similis (Planas 2005)

performs chunking to split sentences into syntagmas to allow sub-sentence matching

However Reinke (2013) observes that for

certain language pairs like English-German

only rather short phrases like simple NPs are

identified and larger syntactic units cannot be

retrieved This can be regarded as disadvantage

as the processing of larger units would be

desirable for the support of professional

computer-assisted human translation (Kriele

(2006) and Macken (2009) cited in Reinke

2013)

MetaMorphoTM (Hodaacutesz and Pohl 2005)

also divides sentences into smaller chunks Moreover it uses a multi-level linguistic

similarity technique (surface form lemma

word class) to determine similarity between

two source-language segments

Other attempts involve deeper linguistic

processing techniques In Pekar and Mitkov

(2007) we propose the lsquothird-generation

translation memoryrsquo which introduces the

concept of semantic matching We employ

syntactic and semantic analysis of segments

stored in a TM to produce a generalised

representation of segments which reduces

equivalent lexical syntactic and lexico-

syntactic constructions into a single

representation Then a retrieval mechanism

operating on these generalised

representations is used to search for useful

previous translations in the TM

This study is part of the third-generation

translation memory project of which the

ultimate goal is to produce more intelligent TM

systems using NLP techniques To the best of

our knowledge clause splitting has not

previously been investigated as a possible

method for increasing the number of relevant

retrieved matches

3 Clause splitting for TM matching

Macklovitch and Russel (2000) note that

when a sufficiently close match cannot be

found for a new input sentence current TM

systems are unable to retrieve sentences that

contain the same clauses or other major

phrases

For example (a) below is a new input

sentence composed of twenty five-character

words The TM contains the sentence (b)

which shares an identical substring with

sentence (a) However as this substring only

makes up only 25 of the sentences total

number of characters it is unlikely that current

TM tools would be able to retrieve it as a fuzzy

match

(a) w1 w2 w3 w4 w5 w6 w20

(b) w1 w2 w3 w4 w5 w21 w35

If clause splitting were employed clauses

would be treated as separate segments thus

increasing the likelihood that clauses which

are subparts of larger units could have a match

score sufficiently high to be retrieved by the

TM system

In this paper we compare the effect of

performing clause splitting before retrieving

matches in a TM In each experiment we

identify the difference in matching

performance when a TM tool is used as is and

when the input file and the translation memory

are first run through a clause splitter The

clause splitter we use in this study is a modified version of the one described in Puscasu (2004)

The original version employs both machine

learning and linguistic rules to identify finite

clauses for both English and Romanian but in

this version only the rule-based module is

used Puscasu (2004) developed a clause

splitting method for both English and

Romanian and to maintain consistency

between the two languages her definition of a

clause is the one prescribed by the Romanian

Academy of Grammar which is that a clause is

18

group of words containing a finite verb

Nonfinite and verbless clauses are therefore

not considered The reported F-measure for

identifying complete clauses in English is

8139 (Marsic 2011)

In this study the clause splitter is used on

both the segments in the input file and the translation memory database After processing

the input and the TM segments with the clause

splitter these were then imported into existing

TM tools to examine how well these tools will

perform if clause splitting is used in pre-

processing

4 Experiments

Experiments were performed to study the

impact of clause splitting when used in pre-

processing for the retrieval of segments in a

TM Our hypothesis is that when a clause

splitter is used the number of relevant

retrieved matches will increase

The effect of clause splitting is examined by

comparing the number of matches retrieved

when TM tools are used as is and when both

the input segments and the segments in the

translation memory are first processed with a

clause splitter before being imported into the

TM tools The tools used are Wordfast

Professional 3 and Trados Studio 2009 which

are among the most widely used TM tools

(Lagoudaki 2006)

Segments used as the input were selected

from the Edinburgh paraphrase corpus (Cohn

Callison-Burch and Lapata 2008) (in Macken

2009) We use a paraphrase corpus because we

wish to investigate the effect of using a clause

splitter in pre-processing to retrieve both

segments that contain the entire input clause

and segments that do not contain the exact

input clause but may still be relevant as they

contain a clause that shares a considerable

degree of similarity with the input

We examine the segments retrieved using

both the default fuzzy match threshold (75

for Wordfast and 70 for SDL Trados) and the

minimum threshold (40 for Wordfast and

30 for Trados) It is not normally

recommended for translators to set a low fuzzy

match threshold as this might result in the

retrieval of too many irrelevant segments if the

translation memory is large However in this

study we argue it would be beneficial to

examine matches retrieved with the minimum

threshold as well Given that translation

memory match scores are mainly calculated

using Levenshtein distance if only one clause

in a segment in the TM matches the input there

is a greater chance of the segment being

retrieved with a lower threshold We therefore

wish to examine whether the employment of

clause splitting will still result in a considerable

improvement from using the Levenshtein

distance-based matching algorithm in most TM

tools if the match threshold setting is already

optimised for the retrieval of sub-segment

clauses

It must also be noted that for this study we

are working with the source segments only

Therefore in the TM files used both the source

and target segments are in English

We conducted two main sets of experiments

referred to as Set A and Set B which are

outlined below In Set A we selected sentences

from the Edinburgh corpus that contained

more than one clause We use one clause or

part of it as the input segment and we store

the entire sentence in the TM In the

experiments where no clause splitting is done

the sentence is stored as is In the experiments

with clause splitting the original input

segments are split into clauses (if there are more than one) and the segments in the TM are

the component clauses of the original sentence

For the experiments done without clause

splitting there are 150 input segments and for

each one we test whether the longer

corresponding segment in the TM can be

retrieved For the experiments where clause

splitting is used the 150 input segments are

split into 180 segments as some of these

segments contain more than one clause We

then test whether the corresponding clause

from the original longer sentence can be

retrieved An example is presented in Table 1

19

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 10: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

for fragments of the source sentence in other TMsegments and internally computes their alignmentprobabilities It then inserts the translations intothe source segment and suggests this new some-times partially translated sentence as a match Al-ternatively only the fragments translated will beinserted in the target segment one after anotherwithout the source sentence words that could notbe retrieved

Figure 1 memoQrsquos fragment assembly function-ality

One limitation of this functionality is that thefragment translations follow the order in whichthey appear in the source language Thus whileit may be very useful for pairs of languages whichfollow a similar structure it may be problematicfor pairs of languages which require reorderingmemoQ uses the frequencies of apparition of thefragments to select one translation or another foreach particular segment As a consequence insome cases the translation selection is wrong thusyielding wrong translation suggestions

Examples 1 and 2 show two similar sentencesin our TM

(1) EN The following message then appearsClick accept to run the programES Aparece el siguiente mensaje Hagaclick en aceptar para ejecutar el programa

(2) EN The window will show the followingThe application will closeES La ventana mostraraacute lo siguiente Secerraraacute la aplicacioacuten

Now imagine we need to translate the sentencein 3

(3) EN The following message then appearsThe application will close

Taking the part of 1 before the colon and the partof 2 after the colon we would be able to producethe right translation as shown in 4

(4) ES Aparece el siguiente mensaje Secerraraacute la aplicacioacuten

In technical texts it is often the case that situa-tions like the one just described happen more thanonce Thus it is not surprising that the transla-tors liked the idea and thought it would be nice tofind a way of automatically retrieving their trans-lations without having to do concordance searchesin the TM Moreover remembering that a partic-ular fragment of a segment had been translatedin the past is not always possible as translatorsmay have forgotten it or different translators mayhave been involved in the project thus not seeingfragments of a segment that other translators havetranslated already

As this idea seemed to have a great potential toincrease the number of TM and fuzzy matches wedecided to implement it and test whether it actu-ally worked The next Section (4) explains howwe proceeded

4 The new segment generator

As explained in Section 3 we decided to test oneidea originated from our internal survey The ideawas to generate new TM segments from fragmentsof already existing segments We called our newtool new segment generator

The first step was to assess the type of textsthat are translated in our company and identify thesegment fragments that could be easily extractedUpon analysis of several sample texts we identi-fied 7 different types of fragments we could workwith

1 Ellipsis2 Colons3 Parenthesis4 Square brackets5 Curly braces6 Quotations7 Text surrounded by tags

In the following subsections (41 ndash 43) we de-scribe how each of these types of fragments wastreated

3

41 Ellipsis and colonsOne possible type of segment would be that inwhich an ellipsis () or a colon () is used inthe middle of the segment In software localisationor user guides sentences such as the ones in 5 and6 could appear

(5) EN Installing new services ServiceXXXX for premium clients [2]ES Instalando nuevos servicios ServicioXXXX para clientes premium [2]

(6) EN You can use the line [abcdef] todescribe any of the following characters ab c d e fES Puede utilizar la liacutenea [abcdef] paradescribir cualquiera de los siguientescaracteres a b c d e f

If a different segment only including the textbefore the ellipsis appears as in Example 7 theTM may not retrieve any fuzzy match The samewould occur with other sentences with colons inwhich the fragment before or after the colon ap-pears

(7) Installing new services

In these cases we proceded as follows

1 Check that there is an ellipsis a colon onboth the source and the target segment

2 Split the segment in two being the first partthe fragment of the segment up to the ellip-siscolon and the second part the fragment ofthe segment after the ellipsiscolon

3 Create a new TM segment for each fragment

42 Parenthesis square brackets and curlybraces

Sometimes a sentence includes a fragment be-tween parethesis square brackets or curly bracesThe content within such characters may constitutea new segment on its own or appear in a differ-ent sentence At the same time it may also be thecase that the same sentence appears in the text butwithout such parenthesis When sentences like theones in Examples 8ndash10 appear it may thus be de-sirable to store the translation of the fragment be-tween the aforementioned characters and the sen-tence without such content

(8) EN Creates an installation package forapplication installation (if it was not createdearlier)ES Crea un paquete de instalacioacuten para lainstalacioacuten de la aplicacioacuten (si no se creoacuteantes)

(9) EN ltreturn code 1gt=[ltdescriptiongt]ES ltcoacutedigo de retorno 1gt=[ltdescripcioacutengt]

(10) EN Could not open key [2] Systemerror [3] Verify that you have sufficientaccess to that key or contact your supportpersonnelES Error al abrir la clave [2] Error en elsistema [3] Compruebe que dispone delos derechos de acceso

The strategy to create new segments was the fol-lowing

1 Check that there is content between parenthe-sis square brackets curly braces on both thesource and the target segment

2 Keep three fragments out of each sentence

(a) A sentence removing those charactersand the content between them

(b) A fragment starting at the opening char-acter and finishing on the closing oneand including the content within In thisfragment the parenthesis square brack-ets or curly braces are mantained

(c) A fragment containing only the con-tent withing those characters (parenthe-sis square brackets or curly braces) butwithout them

3 Create a new TM segment for each fragment

At this preliminary stage we considered thatwhen a sentence has several clauses in parenthesissquare brackets or curly braces these appear in thesame order in the target language This was doneso because for the type of texts used so far to testour application (software manuals) and the pair oflanguages used (English into Spanish) this seemsto be the usual case In future work we plan to fur-ther evaluate this issue and consider other waysof ensuring that the right translation is assigned toeach clause

4

43 Quotations and text within tagsQuotations and double tags appearing in the textwere handled differently As the text within thequotations or tags might be part of the sentencewhere it appears it could not be removed withoutadding too much noise to the data Thus we iden-tified sentences with quotations andor tags wethen removed the quotations andor tags and keptthe same sentence without them as a new segmentFinally we also kept the text within the quotationmarks or tags as new segments Examples 11ndash12illustrate this kind of segments

(11) EN You can only set the values of settingsthat the policy allows to be modified that isunlocked settingsES Solo se pueden establecer los valoresde los paraacutemetros que al directiva permitemodificar es decir los paraacutemetrosdesbloqueados

(12) EN If you clear the lt1gtInherit settingsfrom parent policylt1gt check box in thelt2gtSettings inheritancelt2gt section of thelt3gtGenerallt3gt section in the propertieswindow of an inherited policy the lockis lifted for that policyES Si anula la seleccioacuten de la casillalt1gtHeredar configuracioacuten de la directivaprimarialt1gt en la seccioacuten lt2gtHerencia deconfiguracioacutenlt2gt que aparece en la seccioacutenlt3gtGenerallt3gt de la ventana depropiedades de una directiva heredada seabriraacute el candado para esa directiva

5 Pilot test

Before integrating our system in a CAT tool andin our normal production workflows we deemedit better to run a pilot test The aim of this testwas to measure to which extent the new segmentsretrieved an increased number of 100 and fuzzymatches

51 Test setWe used as a test set a real translation project com-ing from one of our clients It is a software manualwritten in English and to be translated into Span-ish We selected memoQ 2015 to be the CAT toolused for our testing because it is one of the com-mon CAT tools used by our translators and be-cause we also wanted to measure the impact of

our approach when using its fragment assemblyfunctionality

The project had in total 425 segments account-ing for 6280 words according to memoQ Ta-ble 1 shows the project statistics as provided bymemoQrsquos analysis tool using the project TM pro-vided by the client Additionally memoQ iden-tified 36 segments (418 words) which could betranslated benefiting from its fragment assemblyfunctionality which uses fragments of segments tocreate new translations

TM match Words SegmentsRepetitions 1064 80

100 0 095-99 4 285-94 0 075-84 285 14

50-74 2523 187No Match 2404 142

Total 6280 425

Table 1 Project statistics according to memoQ us-ing the project TM

Taking this analysis as the starting point of ourpilot test we generated new segments using theapproach described in Section 4 We used threedifferent TMs to further assess whether the sizeof the translation memory matters for generatingtranslations of segments and retrieving more trans-lations The first TM (Project TM) was the projectTM provided by the client The second TM (Prod-uct TM) was a TM including all projects done forthe same product of the client Finally the thirdTM (Client TM) included all projects of that clientand thus was the biggest one Table 2 summarisesthe size of the three TMs

Segments WordsEN ES

Project TM 16842 212472 244159456631

Product TM 20923 274542 317797592339

Client TM 256099 3427861 39517327379593

Table 2 Size of the different TMs used

We then generated new TM segments and storedthem as new TMs Table 3 shows the number ofnew segments generated using our approach

Table 4 breaks down the number of segmentsgenerated using each strategy and for each TM

5

Segments EN ESProject TM 6776 56973 66297

new segments 123270Product TM 7760 71034 83125

new segments 154159new Client TM 74041 662714 769705new segments 1432419

Table 3 Size of the new TMs generated using ourapproach

used As can be observed some types are moreproductive than others When using the smallerTMs (project and product) the most prolific seg-ment generator category was the one which ex-tracted text surrounded by tags However whenusing the whole clientrsquos TM the text betweenparenthesis was more prolific

TM Proj TM Prod TM clientEllipsis 7 6 50Colon 1801 2094 17894Parenthesis 1361 1621 29637Square bra 78 73 598Curly braces 0 0 0Quotations 1085 1146 8797Tags 2523 2892 17792

Table 4 Number of newly generated segments pertype and TM used

We then tested how many segments would beretrieved using our newly created TMs both aloneand in combination with the TMs we previouslyhad MemoQ offers the possibility of activatingand deactivating different TMs when preparing afile for translation We thus prepared the projectfile using 11 combinations to assess which com-bination performed better as well as whether thenew TMs where having any impact in the projectThese 11 scenarios were the following

1 TM1 The project TM as provided by theclient

2 TM2 Only the new segments generatedfrom the project TM provided by the client

3 TM3 A combination of the project TM andthe new segments retrieved from it

4 TM4 Only the new segments generatedfrom the product TM

5 TM5 The project TM and the new segmentsgenerated from the product TM

6 TM6 The project TM combined withthe new segments TMs generated from theproject TM and the ones from the productTM

7 TM7 Only the new segments generatedfrom the client TM

8 TM8 The project TM combined with thenew segments generated from the client TM

9 TM9 The project TM combined with thenew TMs generated from the client TM andthe ones from the project TM

10 TM10 The project TM combined with thenew segments generated from the client TMand the ones from the product TM

11 TM11 The project TM combined with thenew segments generated from the client TMthe ones from the project TM and the onesfrom the product TM

The preparation of a file for translation typ-ically includes both analysing the file and pre-translating it When the fragment assembly func-tionality from memoQ is activated informationabout how many segments could be translated us-ing fragments is also provided Tables 5 and 6summarise the results we obtained for each TMenvironment when preparing the project for trans-lation

As can be observed using the TMs with newsegments decreased in all cases the number of seg-ments not started and increased the number of seg-ments translated using fragments (cf Table 5) Italso seems clear that size matters and that the big-ger the TM with new fragments the higher thenumber of segments that benefit from fragments(cf TM2 TM4 and TM7 in Table 5)

However this does not hold true for the pre-translation In all cases in which the new TMswere used in isolation (TM2 TM4 and TM7) thenumber of pretranslated segments decreases Thismay be due to the fact that those TMs only containfragments of the original segments present in thedifferent TMs used to generate the segments

When combined with the project TM the num-ber of pretranslated segments increases (cf TM3TM5 TM6 and TM8-TM11) The best overall re-sults are obtained when using the project TM ei-ther in combination with both the new TM gener-ated from the project TM and the new TM gen-

6

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194

Table 5 Overview on the number of segments pre-translated translated using fragments or not started

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11

100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87

Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band

erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)

If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band

In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations

from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators

6 Conclusion and future work

In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent

We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators

We have also envisioned the combination ofalready existing methods for retrieving a higher

7

number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)

Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesKevin Flanagan 2015 Subsegment recall in Transla-

tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January

Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)

Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT

Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria

Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008

Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters

Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas

Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT

8

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015

Spotting false translation segments in translation memories

Eduard BarbuTranslatednet

eduardtranslatednet

Abstract

The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories

1 Introduction

MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface

The quality of the translations using the firstmethod is high and the errors are relatively few

1httpsmymemorytranslatednet

6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V

Figure 1 The distribution of automatically ac-quired memories in MyMemory

However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions

The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task

The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments

9

2 Related Work

The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval

There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora

A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours

2httpwwwxbenchnet

3 The data

In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types

1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web

2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)

3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments

4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)

The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors

10

It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1

CG =ls minus ldradic

34(ls + ld)(1)

In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments

If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426

The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the

3This simple idea is implemented in many sentence align-ers

4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom

Church Gale ScoreP

rob

ab

ility C

hu

rch

Ga

le S

co

re

minus4 minus2 0 2 4

00

02

04

06

08

Figure 2 The distribution of Church Gale Scoresin the Matecat Set

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

01

02

03

04

Figure 3 The distribution of Church Gale Scoresin the Collaborative Set

11

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

00

00

04

00

08

Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet

Church Gale score varies in the interval minus131516015

To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6

In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows

bull Training Set It contains 1243 bi-segmentsand has 373 negative example

bull Test Set It contains 309 bi-segments and has87 negatives examples

The proportion of the negative examples in bothsets is approximately 30

minus4 minus2 0 2 4

minus4

minus2

02

4Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 5 The Q-Q plot for the Matecat set

minus4 minus2 0 2 4

minus1

00

minus5

00

50

Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 6 The Q-Q plot for the Collaborative set

12

4 Machine Learning

In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them

41 FeaturesThe features computed for the training and test setare the following

bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt

bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language

bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0

bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0

bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0

bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0

bull has capital letters The value of the featureis 1 if the source or target segments contain

words that have at least a capital letter other-wise is 0

bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters

bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions

bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)

bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties

bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity

bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity

13

bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment

bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters

bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters

bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2

All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation

The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment

is done with the highly accurate language detectorCybozu5

42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)

bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree

bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees

bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores

bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms

bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances

bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can

5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd

14

give good results in classification problemswhere the decision boundary is irregular

5 Results and discussion

We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1

Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071

Table 1 The results of the three-fold stratifiedclassification

Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)

The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2

The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set

Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059

Table 2 The results of the classification on the testset

Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section

6 Conclusions and further work

In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the

15

performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database

There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative

Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471

ReferencesWilliam A Gale and Kenneth W Church 1993 A

program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS

B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April

Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-

lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31

Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830

Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226

Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory

Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas

16

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015

Abstract

We propose the integration of clause

splitting as a pre-processing step for match

retrieval in Translation Memory (TM)

systems to increase the number of relevant

sub-segment matches Through a series of

experiments we investigate the impact of

clause splitting in instances where the input

does not match an entire segment in the

TM but only a clause from a segment Our

results show that there is a statistically

significant increase in the number of

retrieved matches when both the input

segments and the segments in the TM are

first processed with a clause splitter

1 Rationale

Translation memory tools have had a great

impact on the translation industry as they

provide considerable assistance to translators

They allow translators to easily re-use

previous translations providing them with

valuable productivity gains in an industry

where there is a great demand for quality

translation delivered in the shortest possible

time However existing tools have some

shortcomings The majority of existing tools

rely on Levenshtein distance and seek to

identify matches only at the sentence level

Semantically similar segments are therefore

difficult to retrieve if the string similarity is not

high enough as are sub-segment matches

because if only part of a sentence matches the

input even if this part is an entire clause it is

unlikely that this sentence would be retrieved

(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text

types such as updated versions of technical

manuals

In this study we aim to address the

problem of retrieving sub-segment matches by

performing clause splitting on the source

segment as a pre-processing step for TM match

retrieval While matches for entire sentences

or almost entire sentences are the most useful

type of matches it is also less likely for such

matches to be found in most text types and

even less so for complex sentences Retrieving

clauses is desirable because there is a higher

chance for a match to be found for a clause than

for a complex sentence and at the same time

clauses are similar to sentences in that they

both contain a subject and a verb hence a

ldquocomplete thoughtrdquo therefore clause matches

are more likely to be in context and to actually

be used by the translator than phrase matches

for example

We perform experiments comparing the

match retrieval performance of TM tools when

they are used as is and when the input file and

the TM segments are first processed with a

clause splitter before being fed into the TM

tool The paper is organised as follows In

section 2 we discuss related work on TM

matching In section 3 we discuss how clause

splitting can be beneficial to TM matching and

how clause splitting was implemented for this

study We then describe our experiments in

section 4 and discuss the results in section 5

Finally we present our conclusion and future

work in section 6

Improving Translation Memory Matching through Clause

Splitting

Katerina Timonera

Research Group in Computational

Linguistics

University of Wolverhampton

krtimoneragmailcom

Ruslan Mitkov

Research Group in Computational

Linguistics

University of Wolverhampton

rmitkovwlvacuk

17

2 Related Work

Attempts to address the shortcomings of

existing tools include the integration of

language processing to break down a sentence

into smaller segments The so-called lsquosecond

generationrsquo TM system Similis (Planas 2005)

performs chunking to split sentences into syntagmas to allow sub-sentence matching

However Reinke (2013) observes that for

certain language pairs like English-German

only rather short phrases like simple NPs are

identified and larger syntactic units cannot be

retrieved This can be regarded as disadvantage

as the processing of larger units would be

desirable for the support of professional

computer-assisted human translation (Kriele

(2006) and Macken (2009) cited in Reinke

2013)

MetaMorphoTM (Hodaacutesz and Pohl 2005)

also divides sentences into smaller chunks Moreover it uses a multi-level linguistic

similarity technique (surface form lemma

word class) to determine similarity between

two source-language segments

Other attempts involve deeper linguistic

processing techniques In Pekar and Mitkov

(2007) we propose the lsquothird-generation

translation memoryrsquo which introduces the

concept of semantic matching We employ

syntactic and semantic analysis of segments

stored in a TM to produce a generalised

representation of segments which reduces

equivalent lexical syntactic and lexico-

syntactic constructions into a single

representation Then a retrieval mechanism

operating on these generalised

representations is used to search for useful

previous translations in the TM

This study is part of the third-generation

translation memory project of which the

ultimate goal is to produce more intelligent TM

systems using NLP techniques To the best of

our knowledge clause splitting has not

previously been investigated as a possible

method for increasing the number of relevant

retrieved matches

3 Clause splitting for TM matching

Macklovitch and Russel (2000) note that

when a sufficiently close match cannot be

found for a new input sentence current TM

systems are unable to retrieve sentences that

contain the same clauses or other major

phrases

For example (a) below is a new input

sentence composed of twenty five-character

words The TM contains the sentence (b)

which shares an identical substring with

sentence (a) However as this substring only

makes up only 25 of the sentences total

number of characters it is unlikely that current

TM tools would be able to retrieve it as a fuzzy

match

(a) w1 w2 w3 w4 w5 w6 w20

(b) w1 w2 w3 w4 w5 w21 w35

If clause splitting were employed clauses

would be treated as separate segments thus

increasing the likelihood that clauses which

are subparts of larger units could have a match

score sufficiently high to be retrieved by the

TM system

In this paper we compare the effect of

performing clause splitting before retrieving

matches in a TM In each experiment we

identify the difference in matching

performance when a TM tool is used as is and

when the input file and the translation memory

are first run through a clause splitter The

clause splitter we use in this study is a modified version of the one described in Puscasu (2004)

The original version employs both machine

learning and linguistic rules to identify finite

clauses for both English and Romanian but in

this version only the rule-based module is

used Puscasu (2004) developed a clause

splitting method for both English and

Romanian and to maintain consistency

between the two languages her definition of a

clause is the one prescribed by the Romanian

Academy of Grammar which is that a clause is

18

group of words containing a finite verb

Nonfinite and verbless clauses are therefore

not considered The reported F-measure for

identifying complete clauses in English is

8139 (Marsic 2011)

In this study the clause splitter is used on

both the segments in the input file and the translation memory database After processing

the input and the TM segments with the clause

splitter these were then imported into existing

TM tools to examine how well these tools will

perform if clause splitting is used in pre-

processing

4 Experiments

Experiments were performed to study the

impact of clause splitting when used in pre-

processing for the retrieval of segments in a

TM Our hypothesis is that when a clause

splitter is used the number of relevant

retrieved matches will increase

The effect of clause splitting is examined by

comparing the number of matches retrieved

when TM tools are used as is and when both

the input segments and the segments in the

translation memory are first processed with a

clause splitter before being imported into the

TM tools The tools used are Wordfast

Professional 3 and Trados Studio 2009 which

are among the most widely used TM tools

(Lagoudaki 2006)

Segments used as the input were selected

from the Edinburgh paraphrase corpus (Cohn

Callison-Burch and Lapata 2008) (in Macken

2009) We use a paraphrase corpus because we

wish to investigate the effect of using a clause

splitter in pre-processing to retrieve both

segments that contain the entire input clause

and segments that do not contain the exact

input clause but may still be relevant as they

contain a clause that shares a considerable

degree of similarity with the input

We examine the segments retrieved using

both the default fuzzy match threshold (75

for Wordfast and 70 for SDL Trados) and the

minimum threshold (40 for Wordfast and

30 for Trados) It is not normally

recommended for translators to set a low fuzzy

match threshold as this might result in the

retrieval of too many irrelevant segments if the

translation memory is large However in this

study we argue it would be beneficial to

examine matches retrieved with the minimum

threshold as well Given that translation

memory match scores are mainly calculated

using Levenshtein distance if only one clause

in a segment in the TM matches the input there

is a greater chance of the segment being

retrieved with a lower threshold We therefore

wish to examine whether the employment of

clause splitting will still result in a considerable

improvement from using the Levenshtein

distance-based matching algorithm in most TM

tools if the match threshold setting is already

optimised for the retrieval of sub-segment

clauses

It must also be noted that for this study we

are working with the source segments only

Therefore in the TM files used both the source

and target segments are in English

We conducted two main sets of experiments

referred to as Set A and Set B which are

outlined below In Set A we selected sentences

from the Edinburgh corpus that contained

more than one clause We use one clause or

part of it as the input segment and we store

the entire sentence in the TM In the

experiments where no clause splitting is done

the sentence is stored as is In the experiments

with clause splitting the original input

segments are split into clauses (if there are more than one) and the segments in the TM are

the component clauses of the original sentence

For the experiments done without clause

splitting there are 150 input segments and for

each one we test whether the longer

corresponding segment in the TM can be

retrieved For the experiments where clause

splitting is used the 150 input segments are

split into 180 segments as some of these

segments contain more than one clause We

then test whether the corresponding clause

from the original longer sentence can be

retrieved An example is presented in Table 1

19

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 11: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

41 Ellipsis and colonsOne possible type of segment would be that inwhich an ellipsis () or a colon () is used inthe middle of the segment In software localisationor user guides sentences such as the ones in 5 and6 could appear

(5) EN Installing new services ServiceXXXX for premium clients [2]ES Instalando nuevos servicios ServicioXXXX para clientes premium [2]

(6) EN You can use the line [abcdef] todescribe any of the following characters ab c d e fES Puede utilizar la liacutenea [abcdef] paradescribir cualquiera de los siguientescaracteres a b c d e f

If a different segment only including the textbefore the ellipsis appears as in Example 7 theTM may not retrieve any fuzzy match The samewould occur with other sentences with colons inwhich the fragment before or after the colon ap-pears

(7) Installing new services

In these cases we proceded as follows

1 Check that there is an ellipsis a colon onboth the source and the target segment

2 Split the segment in two being the first partthe fragment of the segment up to the ellip-siscolon and the second part the fragment ofthe segment after the ellipsiscolon

3 Create a new TM segment for each fragment

42 Parenthesis square brackets and curlybraces

Sometimes a sentence includes a fragment be-tween parethesis square brackets or curly bracesThe content within such characters may constitutea new segment on its own or appear in a differ-ent sentence At the same time it may also be thecase that the same sentence appears in the text butwithout such parenthesis When sentences like theones in Examples 8ndash10 appear it may thus be de-sirable to store the translation of the fragment be-tween the aforementioned characters and the sen-tence without such content

(8) EN Creates an installation package forapplication installation (if it was not createdearlier)ES Crea un paquete de instalacioacuten para lainstalacioacuten de la aplicacioacuten (si no se creoacuteantes)

(9) EN ltreturn code 1gt=[ltdescriptiongt]ES ltcoacutedigo de retorno 1gt=[ltdescripcioacutengt]

(10) EN Could not open key [2] Systemerror [3] Verify that you have sufficientaccess to that key or contact your supportpersonnelES Error al abrir la clave [2] Error en elsistema [3] Compruebe que dispone delos derechos de acceso

The strategy to create new segments was the fol-lowing

1 Check that there is content between parenthe-sis square brackets curly braces on both thesource and the target segment

2 Keep three fragments out of each sentence

(a) A sentence removing those charactersand the content between them

(b) A fragment starting at the opening char-acter and finishing on the closing oneand including the content within In thisfragment the parenthesis square brack-ets or curly braces are mantained

(c) A fragment containing only the con-tent withing those characters (parenthe-sis square brackets or curly braces) butwithout them

3 Create a new TM segment for each fragment

At this preliminary stage we considered thatwhen a sentence has several clauses in parenthesissquare brackets or curly braces these appear in thesame order in the target language This was doneso because for the type of texts used so far to testour application (software manuals) and the pair oflanguages used (English into Spanish) this seemsto be the usual case In future work we plan to fur-ther evaluate this issue and consider other waysof ensuring that the right translation is assigned toeach clause

4

43 Quotations and text within tagsQuotations and double tags appearing in the textwere handled differently As the text within thequotations or tags might be part of the sentencewhere it appears it could not be removed withoutadding too much noise to the data Thus we iden-tified sentences with quotations andor tags wethen removed the quotations andor tags and keptthe same sentence without them as a new segmentFinally we also kept the text within the quotationmarks or tags as new segments Examples 11ndash12illustrate this kind of segments

(11) EN You can only set the values of settingsthat the policy allows to be modified that isunlocked settingsES Solo se pueden establecer los valoresde los paraacutemetros que al directiva permitemodificar es decir los paraacutemetrosdesbloqueados

(12) EN If you clear the lt1gtInherit settingsfrom parent policylt1gt check box in thelt2gtSettings inheritancelt2gt section of thelt3gtGenerallt3gt section in the propertieswindow of an inherited policy the lockis lifted for that policyES Si anula la seleccioacuten de la casillalt1gtHeredar configuracioacuten de la directivaprimarialt1gt en la seccioacuten lt2gtHerencia deconfiguracioacutenlt2gt que aparece en la seccioacutenlt3gtGenerallt3gt de la ventana depropiedades de una directiva heredada seabriraacute el candado para esa directiva

5 Pilot test

Before integrating our system in a CAT tool andin our normal production workflows we deemedit better to run a pilot test The aim of this testwas to measure to which extent the new segmentsretrieved an increased number of 100 and fuzzymatches

51 Test setWe used as a test set a real translation project com-ing from one of our clients It is a software manualwritten in English and to be translated into Span-ish We selected memoQ 2015 to be the CAT toolused for our testing because it is one of the com-mon CAT tools used by our translators and be-cause we also wanted to measure the impact of

our approach when using its fragment assemblyfunctionality

The project had in total 425 segments account-ing for 6280 words according to memoQ Ta-ble 1 shows the project statistics as provided bymemoQrsquos analysis tool using the project TM pro-vided by the client Additionally memoQ iden-tified 36 segments (418 words) which could betranslated benefiting from its fragment assemblyfunctionality which uses fragments of segments tocreate new translations

TM match Words SegmentsRepetitions 1064 80

100 0 095-99 4 285-94 0 075-84 285 14

50-74 2523 187No Match 2404 142

Total 6280 425

Table 1 Project statistics according to memoQ us-ing the project TM

Taking this analysis as the starting point of ourpilot test we generated new segments using theapproach described in Section 4 We used threedifferent TMs to further assess whether the sizeof the translation memory matters for generatingtranslations of segments and retrieving more trans-lations The first TM (Project TM) was the projectTM provided by the client The second TM (Prod-uct TM) was a TM including all projects done forthe same product of the client Finally the thirdTM (Client TM) included all projects of that clientand thus was the biggest one Table 2 summarisesthe size of the three TMs

Segments WordsEN ES

Project TM 16842 212472 244159456631

Product TM 20923 274542 317797592339

Client TM 256099 3427861 39517327379593

Table 2 Size of the different TMs used

We then generated new TM segments and storedthem as new TMs Table 3 shows the number ofnew segments generated using our approach

Table 4 breaks down the number of segmentsgenerated using each strategy and for each TM

5

Segments EN ESProject TM 6776 56973 66297

new segments 123270Product TM 7760 71034 83125

new segments 154159new Client TM 74041 662714 769705new segments 1432419

Table 3 Size of the new TMs generated using ourapproach

used As can be observed some types are moreproductive than others When using the smallerTMs (project and product) the most prolific seg-ment generator category was the one which ex-tracted text surrounded by tags However whenusing the whole clientrsquos TM the text betweenparenthesis was more prolific

TM Proj TM Prod TM clientEllipsis 7 6 50Colon 1801 2094 17894Parenthesis 1361 1621 29637Square bra 78 73 598Curly braces 0 0 0Quotations 1085 1146 8797Tags 2523 2892 17792

Table 4 Number of newly generated segments pertype and TM used

We then tested how many segments would beretrieved using our newly created TMs both aloneand in combination with the TMs we previouslyhad MemoQ offers the possibility of activatingand deactivating different TMs when preparing afile for translation We thus prepared the projectfile using 11 combinations to assess which com-bination performed better as well as whether thenew TMs where having any impact in the projectThese 11 scenarios were the following

1 TM1 The project TM as provided by theclient

2 TM2 Only the new segments generatedfrom the project TM provided by the client

3 TM3 A combination of the project TM andthe new segments retrieved from it

4 TM4 Only the new segments generatedfrom the product TM

5 TM5 The project TM and the new segmentsgenerated from the product TM

6 TM6 The project TM combined withthe new segments TMs generated from theproject TM and the ones from the productTM

7 TM7 Only the new segments generatedfrom the client TM

8 TM8 The project TM combined with thenew segments generated from the client TM

9 TM9 The project TM combined with thenew TMs generated from the client TM andthe ones from the project TM

10 TM10 The project TM combined with thenew segments generated from the client TMand the ones from the product TM

11 TM11 The project TM combined with thenew segments generated from the client TMthe ones from the project TM and the onesfrom the product TM

The preparation of a file for translation typ-ically includes both analysing the file and pre-translating it When the fragment assembly func-tionality from memoQ is activated informationabout how many segments could be translated us-ing fragments is also provided Tables 5 and 6summarise the results we obtained for each TMenvironment when preparing the project for trans-lation

As can be observed using the TMs with newsegments decreased in all cases the number of seg-ments not started and increased the number of seg-ments translated using fragments (cf Table 5) Italso seems clear that size matters and that the big-ger the TM with new fragments the higher thenumber of segments that benefit from fragments(cf TM2 TM4 and TM7 in Table 5)

However this does not hold true for the pre-translation In all cases in which the new TMswere used in isolation (TM2 TM4 and TM7) thenumber of pretranslated segments decreases Thismay be due to the fact that those TMs only containfragments of the original segments present in thedifferent TMs used to generate the segments

When combined with the project TM the num-ber of pretranslated segments increases (cf TM3TM5 TM6 and TM8-TM11) The best overall re-sults are obtained when using the project TM ei-ther in combination with both the new TM gener-ated from the project TM and the new TM gen-

6

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194

Table 5 Overview on the number of segments pre-translated translated using fragments or not started

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11

100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87

Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band

erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)

If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band

In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations

from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators

6 Conclusion and future work

In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent

We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators

We have also envisioned the combination ofalready existing methods for retrieving a higher

7

number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)

Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesKevin Flanagan 2015 Subsegment recall in Transla-

tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January

Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)

Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT

Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria

Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008

Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters

Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas

Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT

8

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015

Spotting false translation segments in translation memories

Eduard BarbuTranslatednet

eduardtranslatednet

Abstract

The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories

1 Introduction

MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface

The quality of the translations using the firstmethod is high and the errors are relatively few

1httpsmymemorytranslatednet

6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V

Figure 1 The distribution of automatically ac-quired memories in MyMemory

However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions

The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task

The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments

9

2 Related Work

The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval

There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora

A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours

2httpwwwxbenchnet

3 The data

In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types

1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web

2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)

3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments

4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)

The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors

10

It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1

CG =ls minus ldradic

34(ls + ld)(1)

In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments

If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426

The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the

3This simple idea is implemented in many sentence align-ers

4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom

Church Gale ScoreP

rob

ab

ility C

hu

rch

Ga

le S

co

re

minus4 minus2 0 2 4

00

02

04

06

08

Figure 2 The distribution of Church Gale Scoresin the Matecat Set

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

01

02

03

04

Figure 3 The distribution of Church Gale Scoresin the Collaborative Set

11

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

00

00

04

00

08

Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet

Church Gale score varies in the interval minus131516015

To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6

In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows

bull Training Set It contains 1243 bi-segmentsand has 373 negative example

bull Test Set It contains 309 bi-segments and has87 negatives examples

The proportion of the negative examples in bothsets is approximately 30

minus4 minus2 0 2 4

minus4

minus2

02

4Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 5 The Q-Q plot for the Matecat set

minus4 minus2 0 2 4

minus1

00

minus5

00

50

Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 6 The Q-Q plot for the Collaborative set

12

4 Machine Learning

In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them

41 FeaturesThe features computed for the training and test setare the following

bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt

bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language

bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0

bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0

bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0

bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0

bull has capital letters The value of the featureis 1 if the source or target segments contain

words that have at least a capital letter other-wise is 0

bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters

bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions

bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)

bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties

bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity

bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity

13

bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment

bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters

bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters

bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2

All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation

The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment

is done with the highly accurate language detectorCybozu5

42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)

bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree

bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees

bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores

bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms

bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances

bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can

5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd

14

give good results in classification problemswhere the decision boundary is irregular

5 Results and discussion

We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1

Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071

Table 1 The results of the three-fold stratifiedclassification

Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)

The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2

The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set

Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059

Table 2 The results of the classification on the testset

Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section

6 Conclusions and further work

In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the

15

performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database

There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative

Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471

ReferencesWilliam A Gale and Kenneth W Church 1993 A

program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS

B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April

Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-

lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31

Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830

Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226

Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory

Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas

16

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015

Abstract

We propose the integration of clause

splitting as a pre-processing step for match

retrieval in Translation Memory (TM)

systems to increase the number of relevant

sub-segment matches Through a series of

experiments we investigate the impact of

clause splitting in instances where the input

does not match an entire segment in the

TM but only a clause from a segment Our

results show that there is a statistically

significant increase in the number of

retrieved matches when both the input

segments and the segments in the TM are

first processed with a clause splitter

1 Rationale

Translation memory tools have had a great

impact on the translation industry as they

provide considerable assistance to translators

They allow translators to easily re-use

previous translations providing them with

valuable productivity gains in an industry

where there is a great demand for quality

translation delivered in the shortest possible

time However existing tools have some

shortcomings The majority of existing tools

rely on Levenshtein distance and seek to

identify matches only at the sentence level

Semantically similar segments are therefore

difficult to retrieve if the string similarity is not

high enough as are sub-segment matches

because if only part of a sentence matches the

input even if this part is an entire clause it is

unlikely that this sentence would be retrieved

(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text

types such as updated versions of technical

manuals

In this study we aim to address the

problem of retrieving sub-segment matches by

performing clause splitting on the source

segment as a pre-processing step for TM match

retrieval While matches for entire sentences

or almost entire sentences are the most useful

type of matches it is also less likely for such

matches to be found in most text types and

even less so for complex sentences Retrieving

clauses is desirable because there is a higher

chance for a match to be found for a clause than

for a complex sentence and at the same time

clauses are similar to sentences in that they

both contain a subject and a verb hence a

ldquocomplete thoughtrdquo therefore clause matches

are more likely to be in context and to actually

be used by the translator than phrase matches

for example

We perform experiments comparing the

match retrieval performance of TM tools when

they are used as is and when the input file and

the TM segments are first processed with a

clause splitter before being fed into the TM

tool The paper is organised as follows In

section 2 we discuss related work on TM

matching In section 3 we discuss how clause

splitting can be beneficial to TM matching and

how clause splitting was implemented for this

study We then describe our experiments in

section 4 and discuss the results in section 5

Finally we present our conclusion and future

work in section 6

Improving Translation Memory Matching through Clause

Splitting

Katerina Timonera

Research Group in Computational

Linguistics

University of Wolverhampton

krtimoneragmailcom

Ruslan Mitkov

Research Group in Computational

Linguistics

University of Wolverhampton

rmitkovwlvacuk

17

2 Related Work

Attempts to address the shortcomings of

existing tools include the integration of

language processing to break down a sentence

into smaller segments The so-called lsquosecond

generationrsquo TM system Similis (Planas 2005)

performs chunking to split sentences into syntagmas to allow sub-sentence matching

However Reinke (2013) observes that for

certain language pairs like English-German

only rather short phrases like simple NPs are

identified and larger syntactic units cannot be

retrieved This can be regarded as disadvantage

as the processing of larger units would be

desirable for the support of professional

computer-assisted human translation (Kriele

(2006) and Macken (2009) cited in Reinke

2013)

MetaMorphoTM (Hodaacutesz and Pohl 2005)

also divides sentences into smaller chunks Moreover it uses a multi-level linguistic

similarity technique (surface form lemma

word class) to determine similarity between

two source-language segments

Other attempts involve deeper linguistic

processing techniques In Pekar and Mitkov

(2007) we propose the lsquothird-generation

translation memoryrsquo which introduces the

concept of semantic matching We employ

syntactic and semantic analysis of segments

stored in a TM to produce a generalised

representation of segments which reduces

equivalent lexical syntactic and lexico-

syntactic constructions into a single

representation Then a retrieval mechanism

operating on these generalised

representations is used to search for useful

previous translations in the TM

This study is part of the third-generation

translation memory project of which the

ultimate goal is to produce more intelligent TM

systems using NLP techniques To the best of

our knowledge clause splitting has not

previously been investigated as a possible

method for increasing the number of relevant

retrieved matches

3 Clause splitting for TM matching

Macklovitch and Russel (2000) note that

when a sufficiently close match cannot be

found for a new input sentence current TM

systems are unable to retrieve sentences that

contain the same clauses or other major

phrases

For example (a) below is a new input

sentence composed of twenty five-character

words The TM contains the sentence (b)

which shares an identical substring with

sentence (a) However as this substring only

makes up only 25 of the sentences total

number of characters it is unlikely that current

TM tools would be able to retrieve it as a fuzzy

match

(a) w1 w2 w3 w4 w5 w6 w20

(b) w1 w2 w3 w4 w5 w21 w35

If clause splitting were employed clauses

would be treated as separate segments thus

increasing the likelihood that clauses which

are subparts of larger units could have a match

score sufficiently high to be retrieved by the

TM system

In this paper we compare the effect of

performing clause splitting before retrieving

matches in a TM In each experiment we

identify the difference in matching

performance when a TM tool is used as is and

when the input file and the translation memory

are first run through a clause splitter The

clause splitter we use in this study is a modified version of the one described in Puscasu (2004)

The original version employs both machine

learning and linguistic rules to identify finite

clauses for both English and Romanian but in

this version only the rule-based module is

used Puscasu (2004) developed a clause

splitting method for both English and

Romanian and to maintain consistency

between the two languages her definition of a

clause is the one prescribed by the Romanian

Academy of Grammar which is that a clause is

18

group of words containing a finite verb

Nonfinite and verbless clauses are therefore

not considered The reported F-measure for

identifying complete clauses in English is

8139 (Marsic 2011)

In this study the clause splitter is used on

both the segments in the input file and the translation memory database After processing

the input and the TM segments with the clause

splitter these were then imported into existing

TM tools to examine how well these tools will

perform if clause splitting is used in pre-

processing

4 Experiments

Experiments were performed to study the

impact of clause splitting when used in pre-

processing for the retrieval of segments in a

TM Our hypothesis is that when a clause

splitter is used the number of relevant

retrieved matches will increase

The effect of clause splitting is examined by

comparing the number of matches retrieved

when TM tools are used as is and when both

the input segments and the segments in the

translation memory are first processed with a

clause splitter before being imported into the

TM tools The tools used are Wordfast

Professional 3 and Trados Studio 2009 which

are among the most widely used TM tools

(Lagoudaki 2006)

Segments used as the input were selected

from the Edinburgh paraphrase corpus (Cohn

Callison-Burch and Lapata 2008) (in Macken

2009) We use a paraphrase corpus because we

wish to investigate the effect of using a clause

splitter in pre-processing to retrieve both

segments that contain the entire input clause

and segments that do not contain the exact

input clause but may still be relevant as they

contain a clause that shares a considerable

degree of similarity with the input

We examine the segments retrieved using

both the default fuzzy match threshold (75

for Wordfast and 70 for SDL Trados) and the

minimum threshold (40 for Wordfast and

30 for Trados) It is not normally

recommended for translators to set a low fuzzy

match threshold as this might result in the

retrieval of too many irrelevant segments if the

translation memory is large However in this

study we argue it would be beneficial to

examine matches retrieved with the minimum

threshold as well Given that translation

memory match scores are mainly calculated

using Levenshtein distance if only one clause

in a segment in the TM matches the input there

is a greater chance of the segment being

retrieved with a lower threshold We therefore

wish to examine whether the employment of

clause splitting will still result in a considerable

improvement from using the Levenshtein

distance-based matching algorithm in most TM

tools if the match threshold setting is already

optimised for the retrieval of sub-segment

clauses

It must also be noted that for this study we

are working with the source segments only

Therefore in the TM files used both the source

and target segments are in English

We conducted two main sets of experiments

referred to as Set A and Set B which are

outlined below In Set A we selected sentences

from the Edinburgh corpus that contained

more than one clause We use one clause or

part of it as the input segment and we store

the entire sentence in the TM In the

experiments where no clause splitting is done

the sentence is stored as is In the experiments

with clause splitting the original input

segments are split into clauses (if there are more than one) and the segments in the TM are

the component clauses of the original sentence

For the experiments done without clause

splitting there are 150 input segments and for

each one we test whether the longer

corresponding segment in the TM can be

retrieved For the experiments where clause

splitting is used the 150 input segments are

split into 180 segments as some of these

segments contain more than one clause We

then test whether the corresponding clause

from the original longer sentence can be

retrieved An example is presented in Table 1

19

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 12: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

43 Quotations and text within tagsQuotations and double tags appearing in the textwere handled differently As the text within thequotations or tags might be part of the sentencewhere it appears it could not be removed withoutadding too much noise to the data Thus we iden-tified sentences with quotations andor tags wethen removed the quotations andor tags and keptthe same sentence without them as a new segmentFinally we also kept the text within the quotationmarks or tags as new segments Examples 11ndash12illustrate this kind of segments

(11) EN You can only set the values of settingsthat the policy allows to be modified that isunlocked settingsES Solo se pueden establecer los valoresde los paraacutemetros que al directiva permitemodificar es decir los paraacutemetrosdesbloqueados

(12) EN If you clear the lt1gtInherit settingsfrom parent policylt1gt check box in thelt2gtSettings inheritancelt2gt section of thelt3gtGenerallt3gt section in the propertieswindow of an inherited policy the lockis lifted for that policyES Si anula la seleccioacuten de la casillalt1gtHeredar configuracioacuten de la directivaprimarialt1gt en la seccioacuten lt2gtHerencia deconfiguracioacutenlt2gt que aparece en la seccioacutenlt3gtGenerallt3gt de la ventana depropiedades de una directiva heredada seabriraacute el candado para esa directiva

5 Pilot test

Before integrating our system in a CAT tool andin our normal production workflows we deemedit better to run a pilot test The aim of this testwas to measure to which extent the new segmentsretrieved an increased number of 100 and fuzzymatches

51 Test setWe used as a test set a real translation project com-ing from one of our clients It is a software manualwritten in English and to be translated into Span-ish We selected memoQ 2015 to be the CAT toolused for our testing because it is one of the com-mon CAT tools used by our translators and be-cause we also wanted to measure the impact of

our approach when using its fragment assemblyfunctionality

The project had in total 425 segments account-ing for 6280 words according to memoQ Ta-ble 1 shows the project statistics as provided bymemoQrsquos analysis tool using the project TM pro-vided by the client Additionally memoQ iden-tified 36 segments (418 words) which could betranslated benefiting from its fragment assemblyfunctionality which uses fragments of segments tocreate new translations

TM match Words SegmentsRepetitions 1064 80

100 0 095-99 4 285-94 0 075-84 285 14

50-74 2523 187No Match 2404 142

Total 6280 425

Table 1 Project statistics according to memoQ us-ing the project TM

Taking this analysis as the starting point of ourpilot test we generated new segments using theapproach described in Section 4 We used threedifferent TMs to further assess whether the sizeof the translation memory matters for generatingtranslations of segments and retrieving more trans-lations The first TM (Project TM) was the projectTM provided by the client The second TM (Prod-uct TM) was a TM including all projects done forthe same product of the client Finally the thirdTM (Client TM) included all projects of that clientand thus was the biggest one Table 2 summarisesthe size of the three TMs

Segments WordsEN ES

Project TM 16842 212472 244159456631

Product TM 20923 274542 317797592339

Client TM 256099 3427861 39517327379593

Table 2 Size of the different TMs used

We then generated new TM segments and storedthem as new TMs Table 3 shows the number ofnew segments generated using our approach

Table 4 breaks down the number of segmentsgenerated using each strategy and for each TM

5

Segments EN ESProject TM 6776 56973 66297

new segments 123270Product TM 7760 71034 83125

new segments 154159new Client TM 74041 662714 769705new segments 1432419

Table 3 Size of the new TMs generated using ourapproach

used As can be observed some types are moreproductive than others When using the smallerTMs (project and product) the most prolific seg-ment generator category was the one which ex-tracted text surrounded by tags However whenusing the whole clientrsquos TM the text betweenparenthesis was more prolific

TM Proj TM Prod TM clientEllipsis 7 6 50Colon 1801 2094 17894Parenthesis 1361 1621 29637Square bra 78 73 598Curly braces 0 0 0Quotations 1085 1146 8797Tags 2523 2892 17792

Table 4 Number of newly generated segments pertype and TM used

We then tested how many segments would beretrieved using our newly created TMs both aloneand in combination with the TMs we previouslyhad MemoQ offers the possibility of activatingand deactivating different TMs when preparing afile for translation We thus prepared the projectfile using 11 combinations to assess which com-bination performed better as well as whether thenew TMs where having any impact in the projectThese 11 scenarios were the following

1 TM1 The project TM as provided by theclient

2 TM2 Only the new segments generatedfrom the project TM provided by the client

3 TM3 A combination of the project TM andthe new segments retrieved from it

4 TM4 Only the new segments generatedfrom the product TM

5 TM5 The project TM and the new segmentsgenerated from the product TM

6 TM6 The project TM combined withthe new segments TMs generated from theproject TM and the ones from the productTM

7 TM7 Only the new segments generatedfrom the client TM

8 TM8 The project TM combined with thenew segments generated from the client TM

9 TM9 The project TM combined with thenew TMs generated from the client TM andthe ones from the project TM

10 TM10 The project TM combined with thenew segments generated from the client TMand the ones from the product TM

11 TM11 The project TM combined with thenew segments generated from the client TMthe ones from the project TM and the onesfrom the product TM

The preparation of a file for translation typ-ically includes both analysing the file and pre-translating it When the fragment assembly func-tionality from memoQ is activated informationabout how many segments could be translated us-ing fragments is also provided Tables 5 and 6summarise the results we obtained for each TMenvironment when preparing the project for trans-lation

As can be observed using the TMs with newsegments decreased in all cases the number of seg-ments not started and increased the number of seg-ments translated using fragments (cf Table 5) Italso seems clear that size matters and that the big-ger the TM with new fragments the higher thenumber of segments that benefit from fragments(cf TM2 TM4 and TM7 in Table 5)

However this does not hold true for the pre-translation In all cases in which the new TMswere used in isolation (TM2 TM4 and TM7) thenumber of pretranslated segments decreases Thismay be due to the fact that those TMs only containfragments of the original segments present in thedifferent TMs used to generate the segments

When combined with the project TM the num-ber of pretranslated segments increases (cf TM3TM5 TM6 and TM8-TM11) The best overall re-sults are obtained when using the project TM ei-ther in combination with both the new TM gener-ated from the project TM and the new TM gen-

6

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194

Table 5 Overview on the number of segments pre-translated translated using fragments or not started

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11

100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87

Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band

erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)

If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band

In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations

from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators

6 Conclusion and future work

In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent

We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators

We have also envisioned the combination ofalready existing methods for retrieving a higher

7

number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)

Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesKevin Flanagan 2015 Subsegment recall in Transla-

tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January

Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)

Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT

Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria

Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008

Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters

Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas

Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT

8

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015

Spotting false translation segments in translation memories

Eduard BarbuTranslatednet

eduardtranslatednet

Abstract

The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories

1 Introduction

MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface

The quality of the translations using the firstmethod is high and the errors are relatively few

1httpsmymemorytranslatednet

6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V

Figure 1 The distribution of automatically ac-quired memories in MyMemory

However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions

The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task

The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments

9

2 Related Work

The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval

There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora

A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours

2httpwwwxbenchnet

3 The data

In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types

1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web

2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)

3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments

4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)

The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors

10

It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1

CG =ls minus ldradic

34(ls + ld)(1)

In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments

If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426

The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the

3This simple idea is implemented in many sentence align-ers

4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom

Church Gale ScoreP

rob

ab

ility C

hu

rch

Ga

le S

co

re

minus4 minus2 0 2 4

00

02

04

06

08

Figure 2 The distribution of Church Gale Scoresin the Matecat Set

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

01

02

03

04

Figure 3 The distribution of Church Gale Scoresin the Collaborative Set

11

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

00

00

04

00

08

Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet

Church Gale score varies in the interval minus131516015

To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6

In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows

bull Training Set It contains 1243 bi-segmentsand has 373 negative example

bull Test Set It contains 309 bi-segments and has87 negatives examples

The proportion of the negative examples in bothsets is approximately 30

minus4 minus2 0 2 4

minus4

minus2

02

4Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 5 The Q-Q plot for the Matecat set

minus4 minus2 0 2 4

minus1

00

minus5

00

50

Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 6 The Q-Q plot for the Collaborative set

12

4 Machine Learning

In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them

41 FeaturesThe features computed for the training and test setare the following

bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt

bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language

bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0

bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0

bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0

bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0

bull has capital letters The value of the featureis 1 if the source or target segments contain

words that have at least a capital letter other-wise is 0

bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters

bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions

bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)

bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties

bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity

bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity

13

bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment

bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters

bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters

bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2

All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation

The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment

is done with the highly accurate language detectorCybozu5

42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)

bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree

bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees

bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores

bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms

bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances

bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can

5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd

14

give good results in classification problemswhere the decision boundary is irregular

5 Results and discussion

We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1

Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071

Table 1 The results of the three-fold stratifiedclassification

Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)

The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2

The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set

Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059

Table 2 The results of the classification on the testset

Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section

6 Conclusions and further work

In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the

15

performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database

There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative

Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471

ReferencesWilliam A Gale and Kenneth W Church 1993 A

program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS

B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April

Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-

lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31

Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830

Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226

Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory

Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas

16

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015

Abstract

We propose the integration of clause

splitting as a pre-processing step for match

retrieval in Translation Memory (TM)

systems to increase the number of relevant

sub-segment matches Through a series of

experiments we investigate the impact of

clause splitting in instances where the input

does not match an entire segment in the

TM but only a clause from a segment Our

results show that there is a statistically

significant increase in the number of

retrieved matches when both the input

segments and the segments in the TM are

first processed with a clause splitter

1 Rationale

Translation memory tools have had a great

impact on the translation industry as they

provide considerable assistance to translators

They allow translators to easily re-use

previous translations providing them with

valuable productivity gains in an industry

where there is a great demand for quality

translation delivered in the shortest possible

time However existing tools have some

shortcomings The majority of existing tools

rely on Levenshtein distance and seek to

identify matches only at the sentence level

Semantically similar segments are therefore

difficult to retrieve if the string similarity is not

high enough as are sub-segment matches

because if only part of a sentence matches the

input even if this part is an entire clause it is

unlikely that this sentence would be retrieved

(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text

types such as updated versions of technical

manuals

In this study we aim to address the

problem of retrieving sub-segment matches by

performing clause splitting on the source

segment as a pre-processing step for TM match

retrieval While matches for entire sentences

or almost entire sentences are the most useful

type of matches it is also less likely for such

matches to be found in most text types and

even less so for complex sentences Retrieving

clauses is desirable because there is a higher

chance for a match to be found for a clause than

for a complex sentence and at the same time

clauses are similar to sentences in that they

both contain a subject and a verb hence a

ldquocomplete thoughtrdquo therefore clause matches

are more likely to be in context and to actually

be used by the translator than phrase matches

for example

We perform experiments comparing the

match retrieval performance of TM tools when

they are used as is and when the input file and

the TM segments are first processed with a

clause splitter before being fed into the TM

tool The paper is organised as follows In

section 2 we discuss related work on TM

matching In section 3 we discuss how clause

splitting can be beneficial to TM matching and

how clause splitting was implemented for this

study We then describe our experiments in

section 4 and discuss the results in section 5

Finally we present our conclusion and future

work in section 6

Improving Translation Memory Matching through Clause

Splitting

Katerina Timonera

Research Group in Computational

Linguistics

University of Wolverhampton

krtimoneragmailcom

Ruslan Mitkov

Research Group in Computational

Linguistics

University of Wolverhampton

rmitkovwlvacuk

17

2 Related Work

Attempts to address the shortcomings of

existing tools include the integration of

language processing to break down a sentence

into smaller segments The so-called lsquosecond

generationrsquo TM system Similis (Planas 2005)

performs chunking to split sentences into syntagmas to allow sub-sentence matching

However Reinke (2013) observes that for

certain language pairs like English-German

only rather short phrases like simple NPs are

identified and larger syntactic units cannot be

retrieved This can be regarded as disadvantage

as the processing of larger units would be

desirable for the support of professional

computer-assisted human translation (Kriele

(2006) and Macken (2009) cited in Reinke

2013)

MetaMorphoTM (Hodaacutesz and Pohl 2005)

also divides sentences into smaller chunks Moreover it uses a multi-level linguistic

similarity technique (surface form lemma

word class) to determine similarity between

two source-language segments

Other attempts involve deeper linguistic

processing techniques In Pekar and Mitkov

(2007) we propose the lsquothird-generation

translation memoryrsquo which introduces the

concept of semantic matching We employ

syntactic and semantic analysis of segments

stored in a TM to produce a generalised

representation of segments which reduces

equivalent lexical syntactic and lexico-

syntactic constructions into a single

representation Then a retrieval mechanism

operating on these generalised

representations is used to search for useful

previous translations in the TM

This study is part of the third-generation

translation memory project of which the

ultimate goal is to produce more intelligent TM

systems using NLP techniques To the best of

our knowledge clause splitting has not

previously been investigated as a possible

method for increasing the number of relevant

retrieved matches

3 Clause splitting for TM matching

Macklovitch and Russel (2000) note that

when a sufficiently close match cannot be

found for a new input sentence current TM

systems are unable to retrieve sentences that

contain the same clauses or other major

phrases

For example (a) below is a new input

sentence composed of twenty five-character

words The TM contains the sentence (b)

which shares an identical substring with

sentence (a) However as this substring only

makes up only 25 of the sentences total

number of characters it is unlikely that current

TM tools would be able to retrieve it as a fuzzy

match

(a) w1 w2 w3 w4 w5 w6 w20

(b) w1 w2 w3 w4 w5 w21 w35

If clause splitting were employed clauses

would be treated as separate segments thus

increasing the likelihood that clauses which

are subparts of larger units could have a match

score sufficiently high to be retrieved by the

TM system

In this paper we compare the effect of

performing clause splitting before retrieving

matches in a TM In each experiment we

identify the difference in matching

performance when a TM tool is used as is and

when the input file and the translation memory

are first run through a clause splitter The

clause splitter we use in this study is a modified version of the one described in Puscasu (2004)

The original version employs both machine

learning and linguistic rules to identify finite

clauses for both English and Romanian but in

this version only the rule-based module is

used Puscasu (2004) developed a clause

splitting method for both English and

Romanian and to maintain consistency

between the two languages her definition of a

clause is the one prescribed by the Romanian

Academy of Grammar which is that a clause is

18

group of words containing a finite verb

Nonfinite and verbless clauses are therefore

not considered The reported F-measure for

identifying complete clauses in English is

8139 (Marsic 2011)

In this study the clause splitter is used on

both the segments in the input file and the translation memory database After processing

the input and the TM segments with the clause

splitter these were then imported into existing

TM tools to examine how well these tools will

perform if clause splitting is used in pre-

processing

4 Experiments

Experiments were performed to study the

impact of clause splitting when used in pre-

processing for the retrieval of segments in a

TM Our hypothesis is that when a clause

splitter is used the number of relevant

retrieved matches will increase

The effect of clause splitting is examined by

comparing the number of matches retrieved

when TM tools are used as is and when both

the input segments and the segments in the

translation memory are first processed with a

clause splitter before being imported into the

TM tools The tools used are Wordfast

Professional 3 and Trados Studio 2009 which

are among the most widely used TM tools

(Lagoudaki 2006)

Segments used as the input were selected

from the Edinburgh paraphrase corpus (Cohn

Callison-Burch and Lapata 2008) (in Macken

2009) We use a paraphrase corpus because we

wish to investigate the effect of using a clause

splitter in pre-processing to retrieve both

segments that contain the entire input clause

and segments that do not contain the exact

input clause but may still be relevant as they

contain a clause that shares a considerable

degree of similarity with the input

We examine the segments retrieved using

both the default fuzzy match threshold (75

for Wordfast and 70 for SDL Trados) and the

minimum threshold (40 for Wordfast and

30 for Trados) It is not normally

recommended for translators to set a low fuzzy

match threshold as this might result in the

retrieval of too many irrelevant segments if the

translation memory is large However in this

study we argue it would be beneficial to

examine matches retrieved with the minimum

threshold as well Given that translation

memory match scores are mainly calculated

using Levenshtein distance if only one clause

in a segment in the TM matches the input there

is a greater chance of the segment being

retrieved with a lower threshold We therefore

wish to examine whether the employment of

clause splitting will still result in a considerable

improvement from using the Levenshtein

distance-based matching algorithm in most TM

tools if the match threshold setting is already

optimised for the retrieval of sub-segment

clauses

It must also be noted that for this study we

are working with the source segments only

Therefore in the TM files used both the source

and target segments are in English

We conducted two main sets of experiments

referred to as Set A and Set B which are

outlined below In Set A we selected sentences

from the Edinburgh corpus that contained

more than one clause We use one clause or

part of it as the input segment and we store

the entire sentence in the TM In the

experiments where no clause splitting is done

the sentence is stored as is In the experiments

with clause splitting the original input

segments are split into clauses (if there are more than one) and the segments in the TM are

the component clauses of the original sentence

For the experiments done without clause

splitting there are 150 input segments and for

each one we test whether the longer

corresponding segment in the TM can be

retrieved For the experiments where clause

splitting is used the 150 input segments are

split into 180 segments as some of these

segments contain more than one clause We

then test whether the corresponding clause

from the original longer sentence can be

retrieved An example is presented in Table 1

19

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 13: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

Segments EN ESProject TM 6776 56973 66297

new segments 123270Product TM 7760 71034 83125

new segments 154159new Client TM 74041 662714 769705new segments 1432419

Table 3 Size of the new TMs generated using ourapproach

used As can be observed some types are moreproductive than others When using the smallerTMs (project and product) the most prolific seg-ment generator category was the one which ex-tracted text surrounded by tags However whenusing the whole clientrsquos TM the text betweenparenthesis was more prolific

TM Proj TM Prod TM clientEllipsis 7 6 50Colon 1801 2094 17894Parenthesis 1361 1621 29637Square bra 78 73 598Curly braces 0 0 0Quotations 1085 1146 8797Tags 2523 2892 17792

Table 4 Number of newly generated segments pertype and TM used

We then tested how many segments would beretrieved using our newly created TMs both aloneand in combination with the TMs we previouslyhad MemoQ offers the possibility of activatingand deactivating different TMs when preparing afile for translation We thus prepared the projectfile using 11 combinations to assess which com-bination performed better as well as whether thenew TMs where having any impact in the projectThese 11 scenarios were the following

1 TM1 The project TM as provided by theclient

2 TM2 Only the new segments generatedfrom the project TM provided by the client

3 TM3 A combination of the project TM andthe new segments retrieved from it

4 TM4 Only the new segments generatedfrom the product TM

5 TM5 The project TM and the new segmentsgenerated from the product TM

6 TM6 The project TM combined withthe new segments TMs generated from theproject TM and the ones from the productTM

7 TM7 Only the new segments generatedfrom the client TM

8 TM8 The project TM combined with thenew segments generated from the client TM

9 TM9 The project TM combined with thenew TMs generated from the client TM andthe ones from the project TM

10 TM10 The project TM combined with thenew segments generated from the client TMand the ones from the product TM

11 TM11 The project TM combined with thenew segments generated from the client TMthe ones from the project TM and the onesfrom the product TM

The preparation of a file for translation typ-ically includes both analysing the file and pre-translating it When the fragment assembly func-tionality from memoQ is activated informationabout how many segments could be translated us-ing fragments is also provided Tables 5 and 6summarise the results we obtained for each TMenvironment when preparing the project for trans-lation

As can be observed using the TMs with newsegments decreased in all cases the number of seg-ments not started and increased the number of seg-ments translated using fragments (cf Table 5) Italso seems clear that size matters and that the big-ger the TM with new fragments the higher thenumber of segments that benefit from fragments(cf TM2 TM4 and TM7 in Table 5)

However this does not hold true for the pre-translation In all cases in which the new TMswere used in isolation (TM2 TM4 and TM7) thenumber of pretranslated segments decreases Thismay be due to the fact that those TMs only containfragments of the original segments present in thedifferent TMs used to generate the segments

When combined with the project TM the num-ber of pretranslated segments increases (cf TM3TM5 TM6 and TM8-TM11) The best overall re-sults are obtained when using the project TM ei-ther in combination with both the new TM gener-ated from the project TM and the new TM gen-

6

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194

Table 5 Overview on the number of segments pre-translated translated using fragments or not started

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11

100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87

Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band

erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)

If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band

In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations

from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators

6 Conclusion and future work

In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent

We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators

We have also envisioned the combination ofalready existing methods for retrieving a higher

7

number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)

Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesKevin Flanagan 2015 Subsegment recall in Transla-

tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January

Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)

Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT

Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria

Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008

Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters

Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas

Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT

8

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015

Spotting false translation segments in translation memories

Eduard BarbuTranslatednet

eduardtranslatednet

Abstract

The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories

1 Introduction

MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface

The quality of the translations using the firstmethod is high and the errors are relatively few

1httpsmymemorytranslatednet

6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V

Figure 1 The distribution of automatically ac-quired memories in MyMemory

However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions

The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task

The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments

9

2 Related Work

The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval

There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora

A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours

2httpwwwxbenchnet

3 The data

In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types

1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web

2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)

3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments

4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)

The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors

10

It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1

CG =ls minus ldradic

34(ls + ld)(1)

In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments

If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426

The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the

3This simple idea is implemented in many sentence align-ers

4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom

Church Gale ScoreP

rob

ab

ility C

hu

rch

Ga

le S

co

re

minus4 minus2 0 2 4

00

02

04

06

08

Figure 2 The distribution of Church Gale Scoresin the Matecat Set

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

01

02

03

04

Figure 3 The distribution of Church Gale Scoresin the Collaborative Set

11

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

00

00

04

00

08

Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet

Church Gale score varies in the interval minus131516015

To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6

In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows

bull Training Set It contains 1243 bi-segmentsand has 373 negative example

bull Test Set It contains 309 bi-segments and has87 negatives examples

The proportion of the negative examples in bothsets is approximately 30

minus4 minus2 0 2 4

minus4

minus2

02

4Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 5 The Q-Q plot for the Matecat set

minus4 minus2 0 2 4

minus1

00

minus5

00

50

Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 6 The Q-Q plot for the Collaborative set

12

4 Machine Learning

In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them

41 FeaturesThe features computed for the training and test setare the following

bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt

bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language

bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0

bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0

bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0

bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0

bull has capital letters The value of the featureis 1 if the source or target segments contain

words that have at least a capital letter other-wise is 0

bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters

bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions

bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)

bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties

bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity

bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity

13

bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment

bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters

bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters

bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2

All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation

The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment

is done with the highly accurate language detectorCybozu5

42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)

bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree

bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees

bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores

bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms

bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances

bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can

5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd

14

give good results in classification problemswhere the decision boundary is irregular

5 Results and discussion

We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1

Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071

Table 1 The results of the three-fold stratifiedclassification

Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)

The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2

The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set

Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059

Table 2 The results of the classification on the testset

Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section

6 Conclusions and further work

In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the

15

performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database

There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative

Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471

ReferencesWilliam A Gale and Kenneth W Church 1993 A

program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS

B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April

Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-

lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31

Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830

Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226

Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory

Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas

16

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015

Abstract

We propose the integration of clause

splitting as a pre-processing step for match

retrieval in Translation Memory (TM)

systems to increase the number of relevant

sub-segment matches Through a series of

experiments we investigate the impact of

clause splitting in instances where the input

does not match an entire segment in the

TM but only a clause from a segment Our

results show that there is a statistically

significant increase in the number of

retrieved matches when both the input

segments and the segments in the TM are

first processed with a clause splitter

1 Rationale

Translation memory tools have had a great

impact on the translation industry as they

provide considerable assistance to translators

They allow translators to easily re-use

previous translations providing them with

valuable productivity gains in an industry

where there is a great demand for quality

translation delivered in the shortest possible

time However existing tools have some

shortcomings The majority of existing tools

rely on Levenshtein distance and seek to

identify matches only at the sentence level

Semantically similar segments are therefore

difficult to retrieve if the string similarity is not

high enough as are sub-segment matches

because if only part of a sentence matches the

input even if this part is an entire clause it is

unlikely that this sentence would be retrieved

(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text

types such as updated versions of technical

manuals

In this study we aim to address the

problem of retrieving sub-segment matches by

performing clause splitting on the source

segment as a pre-processing step for TM match

retrieval While matches for entire sentences

or almost entire sentences are the most useful

type of matches it is also less likely for such

matches to be found in most text types and

even less so for complex sentences Retrieving

clauses is desirable because there is a higher

chance for a match to be found for a clause than

for a complex sentence and at the same time

clauses are similar to sentences in that they

both contain a subject and a verb hence a

ldquocomplete thoughtrdquo therefore clause matches

are more likely to be in context and to actually

be used by the translator than phrase matches

for example

We perform experiments comparing the

match retrieval performance of TM tools when

they are used as is and when the input file and

the TM segments are first processed with a

clause splitter before being fed into the TM

tool The paper is organised as follows In

section 2 we discuss related work on TM

matching In section 3 we discuss how clause

splitting can be beneficial to TM matching and

how clause splitting was implemented for this

study We then describe our experiments in

section 4 and discuss the results in section 5

Finally we present our conclusion and future

work in section 6

Improving Translation Memory Matching through Clause

Splitting

Katerina Timonera

Research Group in Computational

Linguistics

University of Wolverhampton

krtimoneragmailcom

Ruslan Mitkov

Research Group in Computational

Linguistics

University of Wolverhampton

rmitkovwlvacuk

17

2 Related Work

Attempts to address the shortcomings of

existing tools include the integration of

language processing to break down a sentence

into smaller segments The so-called lsquosecond

generationrsquo TM system Similis (Planas 2005)

performs chunking to split sentences into syntagmas to allow sub-sentence matching

However Reinke (2013) observes that for

certain language pairs like English-German

only rather short phrases like simple NPs are

identified and larger syntactic units cannot be

retrieved This can be regarded as disadvantage

as the processing of larger units would be

desirable for the support of professional

computer-assisted human translation (Kriele

(2006) and Macken (2009) cited in Reinke

2013)

MetaMorphoTM (Hodaacutesz and Pohl 2005)

also divides sentences into smaller chunks Moreover it uses a multi-level linguistic

similarity technique (surface form lemma

word class) to determine similarity between

two source-language segments

Other attempts involve deeper linguistic

processing techniques In Pekar and Mitkov

(2007) we propose the lsquothird-generation

translation memoryrsquo which introduces the

concept of semantic matching We employ

syntactic and semantic analysis of segments

stored in a TM to produce a generalised

representation of segments which reduces

equivalent lexical syntactic and lexico-

syntactic constructions into a single

representation Then a retrieval mechanism

operating on these generalised

representations is used to search for useful

previous translations in the TM

This study is part of the third-generation

translation memory project of which the

ultimate goal is to produce more intelligent TM

systems using NLP techniques To the best of

our knowledge clause splitting has not

previously been investigated as a possible

method for increasing the number of relevant

retrieved matches

3 Clause splitting for TM matching

Macklovitch and Russel (2000) note that

when a sufficiently close match cannot be

found for a new input sentence current TM

systems are unable to retrieve sentences that

contain the same clauses or other major

phrases

For example (a) below is a new input

sentence composed of twenty five-character

words The TM contains the sentence (b)

which shares an identical substring with

sentence (a) However as this substring only

makes up only 25 of the sentences total

number of characters it is unlikely that current

TM tools would be able to retrieve it as a fuzzy

match

(a) w1 w2 w3 w4 w5 w6 w20

(b) w1 w2 w3 w4 w5 w21 w35

If clause splitting were employed clauses

would be treated as separate segments thus

increasing the likelihood that clauses which

are subparts of larger units could have a match

score sufficiently high to be retrieved by the

TM system

In this paper we compare the effect of

performing clause splitting before retrieving

matches in a TM In each experiment we

identify the difference in matching

performance when a TM tool is used as is and

when the input file and the translation memory

are first run through a clause splitter The

clause splitter we use in this study is a modified version of the one described in Puscasu (2004)

The original version employs both machine

learning and linguistic rules to identify finite

clauses for both English and Romanian but in

this version only the rule-based module is

used Puscasu (2004) developed a clause

splitting method for both English and

Romanian and to maintain consistency

between the two languages her definition of a

clause is the one prescribed by the Romanian

Academy of Grammar which is that a clause is

18

group of words containing a finite verb

Nonfinite and verbless clauses are therefore

not considered The reported F-measure for

identifying complete clauses in English is

8139 (Marsic 2011)

In this study the clause splitter is used on

both the segments in the input file and the translation memory database After processing

the input and the TM segments with the clause

splitter these were then imported into existing

TM tools to examine how well these tools will

perform if clause splitting is used in pre-

processing

4 Experiments

Experiments were performed to study the

impact of clause splitting when used in pre-

processing for the retrieval of segments in a

TM Our hypothesis is that when a clause

splitter is used the number of relevant

retrieved matches will increase

The effect of clause splitting is examined by

comparing the number of matches retrieved

when TM tools are used as is and when both

the input segments and the segments in the

translation memory are first processed with a

clause splitter before being imported into the

TM tools The tools used are Wordfast

Professional 3 and Trados Studio 2009 which

are among the most widely used TM tools

(Lagoudaki 2006)

Segments used as the input were selected

from the Edinburgh paraphrase corpus (Cohn

Callison-Burch and Lapata 2008) (in Macken

2009) We use a paraphrase corpus because we

wish to investigate the effect of using a clause

splitter in pre-processing to retrieve both

segments that contain the entire input clause

and segments that do not contain the exact

input clause but may still be relevant as they

contain a clause that shares a considerable

degree of similarity with the input

We examine the segments retrieved using

both the default fuzzy match threshold (75

for Wordfast and 70 for SDL Trados) and the

minimum threshold (40 for Wordfast and

30 for Trados) It is not normally

recommended for translators to set a low fuzzy

match threshold as this might result in the

retrieval of too many irrelevant segments if the

translation memory is large However in this

study we argue it would be beneficial to

examine matches retrieved with the minimum

threshold as well Given that translation

memory match scores are mainly calculated

using Levenshtein distance if only one clause

in a segment in the TM matches the input there

is a greater chance of the segment being

retrieved with a lower threshold We therefore

wish to examine whether the employment of

clause splitting will still result in a considerable

improvement from using the Levenshtein

distance-based matching algorithm in most TM

tools if the match threshold setting is already

optimised for the retrieval of sub-segment

clauses

It must also be noted that for this study we

are working with the source segments only

Therefore in the TM files used both the source

and target segments are in English

We conducted two main sets of experiments

referred to as Set A and Set B which are

outlined below In Set A we selected sentences

from the Edinburgh corpus that contained

more than one clause We use one clause or

part of it as the input segment and we store

the entire sentence in the TM In the

experiments where no clause splitting is done

the sentence is stored as is In the experiments

with clause splitting the original input

segments are split into clauses (if there are more than one) and the segments in the TM are

the component clauses of the original sentence

For the experiments done without clause

splitting there are 150 input segments and for

each one we test whether the longer

corresponding segment in the TM can be

retrieved For the experiments where clause

splitting is used the 150 input segments are

split into 180 segments as some of these

segments contain more than one clause We

then test whether the corresponding clause

from the original longer sentence can be

retrieved An example is presented in Table 1

19

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 14: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11Not started 234 204 150 202 143 143 38 27 26 26 26Pre-trans 155 111 178 108 179 183 139 199 205 201 205Fragments 36 110 97 115 103 99 248 199 194 198 194

Table 5 Overview on the number of segments pre-translated translated using fragments or not started

TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 TM10 TM11

100 0 5 5 5 5 5 5 5 5 5 595-99 2 3 3 4 4 4 9 9 9 9 985-94 0 1 1 1 1 1 2 2 2 2 275-84 14 9 14 9 15 15 12 16 16 16 1650-74 187 130 198 136 202 202 180 223 226 226 226No match 142 197 124 190 118 118 137 90 87 87 87

Table 6 Overview on the number of segments retrieved using the different TMs classified by fuzzy band

erated from the clientrsquos TM (TM9) or combin-ing the three new TMs (TM11) When using thenew TM generated from the client TM togetherwith the new product TM (TM10) the number ofpretranslated segments decreases slightly (205rarr201) while the number of segments translated us-ing fragments increases slightly (194rarr 198)

If we now look at the results obtained in termsof fuzzy matches (cf Table 6) the same ten-dencies can be observed From the very begin-ning the number of 100 matches increases from0 to 5 when using the new TM generated fromthe project TM Similarly the number of matchesfor the 95ndash99 fuzzy band also increases (from2 segments for TM1 up to 9 segments for TM7ndashTM11) and the 85ndash94 fuzzy band retrieves anew segment when using the new TM generatedfrom the client TM The greatest increase in fuzzymatches is experienced by the 50ndash74 fuzzy band(from 142 segments in TM1 up to 226 in TM9ndashTM11) Although this band is usually discardedin translation projects as no productivity gains areachieved an analysis of the new segments re-trieved is needed This would give us potentialhints as to what to improve in our new segmentgenerator so that the fuzzy scores are higher Atthe same time it could be the case that our seg-ments are reusable although the rest of the sen-tence is not If this was proven true a productivityincrease may be observed in this band

In general it seems that the generation of newTMs using fragments of previously existing seg-ments has a positive effect in the TM fuzzy matchretrieval as well as in the generation of translations

from fragments that memoQ offers These pos-itive results indicate that working further on thisapproach may improve the fuzzy matches and thusenhance the productivity of our translators

6 Conclusion and future work

In this paper we have explored a new way of gen-erating segments out of previously existing seg-ments Although we use a naive approach and onlymake use of punctuation marks and tags to gener-ate such segments positive results have been ob-tained in our pilot test Moreover as we do not useany type of linguistic information our approachcould be considered language independent

We are currently working on improving thescript that extracts the fragments of segments andgenerates new ones This is being done by alsoanalysing in more detail the segments currently re-trieved and the segments that could additionallybe retrieved The next step will be to generateyet newer segments by combining the fragmentsretrieved together and pre-translating files imple-menting our own fragment assembly approachprior to translation Once this has been donewe will test the final result of our TM popula-tion and pre-processing in real projects to mea-sure whether by using this approach translators dotranslate faster than translating from scratch Thisfinal test will additionally serve as a quality evalu-ation of the segments newly produced as we willbe able to compare them with the final output pro-duced by translators

We have also envisioned the combination ofalready existing methods for retrieving a higher

7

number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)

Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesKevin Flanagan 2015 Subsegment recall in Transla-

tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January

Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)

Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT

Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria

Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008

Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters

Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas

Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT

8

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015

Spotting false translation segments in translation memories

Eduard BarbuTranslatednet

eduardtranslatednet

Abstract

The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories

1 Introduction

MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface

The quality of the translations using the firstmethod is high and the errors are relatively few

1httpsmymemorytranslatednet

6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V

Figure 1 The distribution of automatically ac-quired memories in MyMemory

However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions

The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task

The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments

9

2 Related Work

The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval

There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora

A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours

2httpwwwxbenchnet

3 The data

In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types

1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web

2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)

3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments

4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)

The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors

10

It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1

CG =ls minus ldradic

34(ls + ld)(1)

In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments

If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426

The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the

3This simple idea is implemented in many sentence align-ers

4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom

Church Gale ScoreP

rob

ab

ility C

hu

rch

Ga

le S

co

re

minus4 minus2 0 2 4

00

02

04

06

08

Figure 2 The distribution of Church Gale Scoresin the Matecat Set

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

01

02

03

04

Figure 3 The distribution of Church Gale Scoresin the Collaborative Set

11

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

00

00

04

00

08

Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet

Church Gale score varies in the interval minus131516015

To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6

In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows

bull Training Set It contains 1243 bi-segmentsand has 373 negative example

bull Test Set It contains 309 bi-segments and has87 negatives examples

The proportion of the negative examples in bothsets is approximately 30

minus4 minus2 0 2 4

minus4

minus2

02

4Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 5 The Q-Q plot for the Matecat set

minus4 minus2 0 2 4

minus1

00

minus5

00

50

Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 6 The Q-Q plot for the Collaborative set

12

4 Machine Learning

In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them

41 FeaturesThe features computed for the training and test setare the following

bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt

bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language

bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0

bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0

bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0

bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0

bull has capital letters The value of the featureis 1 if the source or target segments contain

words that have at least a capital letter other-wise is 0

bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters

bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions

bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)

bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties

bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity

bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity

13

bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment

bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters

bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters

bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2

All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation

The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment

is done with the highly accurate language detectorCybozu5

42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)

bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree

bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees

bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores

bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms

bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances

bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can

5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd

14

give good results in classification problemswhere the decision boundary is irregular

5 Results and discussion

We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1

Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071

Table 1 The results of the three-fold stratifiedclassification

Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)

The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2

The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set

Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059

Table 2 The results of the classification on the testset

Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section

6 Conclusions and further work

In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the

15

performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database

There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative

Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471

ReferencesWilliam A Gale and Kenneth W Church 1993 A

program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS

B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April

Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-

lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31

Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830

Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226

Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory

Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas

16

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015

Abstract

We propose the integration of clause

splitting as a pre-processing step for match

retrieval in Translation Memory (TM)

systems to increase the number of relevant

sub-segment matches Through a series of

experiments we investigate the impact of

clause splitting in instances where the input

does not match an entire segment in the

TM but only a clause from a segment Our

results show that there is a statistically

significant increase in the number of

retrieved matches when both the input

segments and the segments in the TM are

first processed with a clause splitter

1 Rationale

Translation memory tools have had a great

impact on the translation industry as they

provide considerable assistance to translators

They allow translators to easily re-use

previous translations providing them with

valuable productivity gains in an industry

where there is a great demand for quality

translation delivered in the shortest possible

time However existing tools have some

shortcomings The majority of existing tools

rely on Levenshtein distance and seek to

identify matches only at the sentence level

Semantically similar segments are therefore

difficult to retrieve if the string similarity is not

high enough as are sub-segment matches

because if only part of a sentence matches the

input even if this part is an entire clause it is

unlikely that this sentence would be retrieved

(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text

types such as updated versions of technical

manuals

In this study we aim to address the

problem of retrieving sub-segment matches by

performing clause splitting on the source

segment as a pre-processing step for TM match

retrieval While matches for entire sentences

or almost entire sentences are the most useful

type of matches it is also less likely for such

matches to be found in most text types and

even less so for complex sentences Retrieving

clauses is desirable because there is a higher

chance for a match to be found for a clause than

for a complex sentence and at the same time

clauses are similar to sentences in that they

both contain a subject and a verb hence a

ldquocomplete thoughtrdquo therefore clause matches

are more likely to be in context and to actually

be used by the translator than phrase matches

for example

We perform experiments comparing the

match retrieval performance of TM tools when

they are used as is and when the input file and

the TM segments are first processed with a

clause splitter before being fed into the TM

tool The paper is organised as follows In

section 2 we discuss related work on TM

matching In section 3 we discuss how clause

splitting can be beneficial to TM matching and

how clause splitting was implemented for this

study We then describe our experiments in

section 4 and discuss the results in section 5

Finally we present our conclusion and future

work in section 6

Improving Translation Memory Matching through Clause

Splitting

Katerina Timonera

Research Group in Computational

Linguistics

University of Wolverhampton

krtimoneragmailcom

Ruslan Mitkov

Research Group in Computational

Linguistics

University of Wolverhampton

rmitkovwlvacuk

17

2 Related Work

Attempts to address the shortcomings of

existing tools include the integration of

language processing to break down a sentence

into smaller segments The so-called lsquosecond

generationrsquo TM system Similis (Planas 2005)

performs chunking to split sentences into syntagmas to allow sub-sentence matching

However Reinke (2013) observes that for

certain language pairs like English-German

only rather short phrases like simple NPs are

identified and larger syntactic units cannot be

retrieved This can be regarded as disadvantage

as the processing of larger units would be

desirable for the support of professional

computer-assisted human translation (Kriele

(2006) and Macken (2009) cited in Reinke

2013)

MetaMorphoTM (Hodaacutesz and Pohl 2005)

also divides sentences into smaller chunks Moreover it uses a multi-level linguistic

similarity technique (surface form lemma

word class) to determine similarity between

two source-language segments

Other attempts involve deeper linguistic

processing techniques In Pekar and Mitkov

(2007) we propose the lsquothird-generation

translation memoryrsquo which introduces the

concept of semantic matching We employ

syntactic and semantic analysis of segments

stored in a TM to produce a generalised

representation of segments which reduces

equivalent lexical syntactic and lexico-

syntactic constructions into a single

representation Then a retrieval mechanism

operating on these generalised

representations is used to search for useful

previous translations in the TM

This study is part of the third-generation

translation memory project of which the

ultimate goal is to produce more intelligent TM

systems using NLP techniques To the best of

our knowledge clause splitting has not

previously been investigated as a possible

method for increasing the number of relevant

retrieved matches

3 Clause splitting for TM matching

Macklovitch and Russel (2000) note that

when a sufficiently close match cannot be

found for a new input sentence current TM

systems are unable to retrieve sentences that

contain the same clauses or other major

phrases

For example (a) below is a new input

sentence composed of twenty five-character

words The TM contains the sentence (b)

which shares an identical substring with

sentence (a) However as this substring only

makes up only 25 of the sentences total

number of characters it is unlikely that current

TM tools would be able to retrieve it as a fuzzy

match

(a) w1 w2 w3 w4 w5 w6 w20

(b) w1 w2 w3 w4 w5 w21 w35

If clause splitting were employed clauses

would be treated as separate segments thus

increasing the likelihood that clauses which

are subparts of larger units could have a match

score sufficiently high to be retrieved by the

TM system

In this paper we compare the effect of

performing clause splitting before retrieving

matches in a TM In each experiment we

identify the difference in matching

performance when a TM tool is used as is and

when the input file and the translation memory

are first run through a clause splitter The

clause splitter we use in this study is a modified version of the one described in Puscasu (2004)

The original version employs both machine

learning and linguistic rules to identify finite

clauses for both English and Romanian but in

this version only the rule-based module is

used Puscasu (2004) developed a clause

splitting method for both English and

Romanian and to maintain consistency

between the two languages her definition of a

clause is the one prescribed by the Romanian

Academy of Grammar which is that a clause is

18

group of words containing a finite verb

Nonfinite and verbless clauses are therefore

not considered The reported F-measure for

identifying complete clauses in English is

8139 (Marsic 2011)

In this study the clause splitter is used on

both the segments in the input file and the translation memory database After processing

the input and the TM segments with the clause

splitter these were then imported into existing

TM tools to examine how well these tools will

perform if clause splitting is used in pre-

processing

4 Experiments

Experiments were performed to study the

impact of clause splitting when used in pre-

processing for the retrieval of segments in a

TM Our hypothesis is that when a clause

splitter is used the number of relevant

retrieved matches will increase

The effect of clause splitting is examined by

comparing the number of matches retrieved

when TM tools are used as is and when both

the input segments and the segments in the

translation memory are first processed with a

clause splitter before being imported into the

TM tools The tools used are Wordfast

Professional 3 and Trados Studio 2009 which

are among the most widely used TM tools

(Lagoudaki 2006)

Segments used as the input were selected

from the Edinburgh paraphrase corpus (Cohn

Callison-Burch and Lapata 2008) (in Macken

2009) We use a paraphrase corpus because we

wish to investigate the effect of using a clause

splitter in pre-processing to retrieve both

segments that contain the entire input clause

and segments that do not contain the exact

input clause but may still be relevant as they

contain a clause that shares a considerable

degree of similarity with the input

We examine the segments retrieved using

both the default fuzzy match threshold (75

for Wordfast and 70 for SDL Trados) and the

minimum threshold (40 for Wordfast and

30 for Trados) It is not normally

recommended for translators to set a low fuzzy

match threshold as this might result in the

retrieval of too many irrelevant segments if the

translation memory is large However in this

study we argue it would be beneficial to

examine matches retrieved with the minimum

threshold as well Given that translation

memory match scores are mainly calculated

using Levenshtein distance if only one clause

in a segment in the TM matches the input there

is a greater chance of the segment being

retrieved with a lower threshold We therefore

wish to examine whether the employment of

clause splitting will still result in a considerable

improvement from using the Levenshtein

distance-based matching algorithm in most TM

tools if the match threshold setting is already

optimised for the retrieval of sub-segment

clauses

It must also be noted that for this study we

are working with the source segments only

Therefore in the TM files used both the source

and target segments are in English

We conducted two main sets of experiments

referred to as Set A and Set B which are

outlined below In Set A we selected sentences

from the Edinburgh corpus that contained

more than one clause We use one clause or

part of it as the input segment and we store

the entire sentence in the TM In the

experiments where no clause splitting is done

the sentence is stored as is In the experiments

with clause splitting the original input

segments are split into clauses (if there are more than one) and the segments in the TM are

the component clauses of the original sentence

For the experiments done without clause

splitting there are 150 input segments and for

each one we test whether the longer

corresponding segment in the TM can be

retrieved For the experiments where clause

splitting is used the 150 input segments are

split into 180 segments as some of these

segments contain more than one clause We

then test whether the corresponding clause

from the original longer sentence can be

retrieved An example is presented in Table 1

19

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 15: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

number of TM matches with our system Amongother approaches it will be interesting to test theinclusion of paraphrasis and semantic similaritymethods to create new TM segments (Gupta andOrasan 2014 Gupta et al 2015)

Finally another potential application of our ap-proach would be the extraction of terminologydatabases In many cases the segments we ex-tract correspond to terms in the source and targettext A closer analysis of them may give us hintsabout their properties so that we can filter candi-date terms and create terminology databases thatcan be used in combination with the TMs to trans-late new projects

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesKevin Flanagan 2015 Subsegment recall in Transla-

tion Memory ndash perceptions expectations and realityThe Journal of Specialised Translation (23)64ndash88January

Rohit Gupta and Orasan 2014 Incorporating Para-phrasing in Translation Memory Matching and Re-trieval In Proceedings of the European Associationof Machine Translation (EAMT-2014)

Rohit Gupta Constantin Orasan Marcos ZampieriMihaela Vela and Josef van Genabith 2015 CanTranslation Memories afford not to use paraphras-ing In Proceedings of the 18th Annual Conferenceof the European Association for Machine Transla-tion Antalya (Turkey) May EAMT

Gaacutebor Hodaacutesz and Gaacutebor Pohl 2005 MetaMor-pho TM a linguistically enriched translation mem-ory In Proceedings of the workshop Modern Ap-proaches in Translation Technologies 2005 pages25ndash30 Borovets Bulgaria

Ruslan Mitkov and Gloria Corpas 2008 Improv-ing Third Generation Translation Memory systemsthrough identification of rehetorical predicates InProceedings of LangTech 2008

Viktor Pekar and Ruslan Mitkov 2007 NewGeneration Translation Memory Content-SensitiveMatching In Proceedings of the 40th AnniversaryCongress of the Swiss Association of TranslatorsTerminologists and Interpreters

Michel Simard and Atsushi Fujita 2012 A PoorManrsquos Translation Memory Using Machine Transla-tion Evaluation Metrics In Proceedings of the TenthConference of the Association for Machine Transla-tion in the Americas

Tom Vanallemeersch and Vincent Vandeghinste 2015Assessing linguistically aware fuzzy matching intranslation memories In Proceedings of the 18thAnnual Conference of the European Associationfor Machine Translation Antalya (Turkey) MayEAMT

8

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015

Spotting false translation segments in translation memories

Eduard BarbuTranslatednet

eduardtranslatednet

Abstract

The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories

1 Introduction

MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface

The quality of the translations using the firstmethod is high and the errors are relatively few

1httpsmymemorytranslatednet

6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V

Figure 1 The distribution of automatically ac-quired memories in MyMemory

However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions

The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task

The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments

9

2 Related Work

The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval

There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora

A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours

2httpwwwxbenchnet

3 The data

In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types

1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web

2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)

3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments

4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)

The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors

10

It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1

CG =ls minus ldradic

34(ls + ld)(1)

In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments

If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426

The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the

3This simple idea is implemented in many sentence align-ers

4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom

Church Gale ScoreP

rob

ab

ility C

hu

rch

Ga

le S

co

re

minus4 minus2 0 2 4

00

02

04

06

08

Figure 2 The distribution of Church Gale Scoresin the Matecat Set

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

01

02

03

04

Figure 3 The distribution of Church Gale Scoresin the Collaborative Set

11

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

00

00

04

00

08

Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet

Church Gale score varies in the interval minus131516015

To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6

In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows

bull Training Set It contains 1243 bi-segmentsand has 373 negative example

bull Test Set It contains 309 bi-segments and has87 negatives examples

The proportion of the negative examples in bothsets is approximately 30

minus4 minus2 0 2 4

minus4

minus2

02

4Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 5 The Q-Q plot for the Matecat set

minus4 minus2 0 2 4

minus1

00

minus5

00

50

Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 6 The Q-Q plot for the Collaborative set

12

4 Machine Learning

In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them

41 FeaturesThe features computed for the training and test setare the following

bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt

bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language

bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0

bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0

bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0

bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0

bull has capital letters The value of the featureis 1 if the source or target segments contain

words that have at least a capital letter other-wise is 0

bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters

bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions

bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)

bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties

bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity

bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity

13

bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment

bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters

bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters

bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2

All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation

The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment

is done with the highly accurate language detectorCybozu5

42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)

bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree

bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees

bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores

bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms

bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances

bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can

5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd

14

give good results in classification problemswhere the decision boundary is irregular

5 Results and discussion

We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1

Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071

Table 1 The results of the three-fold stratifiedclassification

Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)

The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2

The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set

Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059

Table 2 The results of the classification on the testset

Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section

6 Conclusions and further work

In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the

15

performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database

There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative

Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471

ReferencesWilliam A Gale and Kenneth W Church 1993 A

program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS

B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April

Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-

lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31

Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830

Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226

Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory

Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas

16

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015

Abstract

We propose the integration of clause

splitting as a pre-processing step for match

retrieval in Translation Memory (TM)

systems to increase the number of relevant

sub-segment matches Through a series of

experiments we investigate the impact of

clause splitting in instances where the input

does not match an entire segment in the

TM but only a clause from a segment Our

results show that there is a statistically

significant increase in the number of

retrieved matches when both the input

segments and the segments in the TM are

first processed with a clause splitter

1 Rationale

Translation memory tools have had a great

impact on the translation industry as they

provide considerable assistance to translators

They allow translators to easily re-use

previous translations providing them with

valuable productivity gains in an industry

where there is a great demand for quality

translation delivered in the shortest possible

time However existing tools have some

shortcomings The majority of existing tools

rely on Levenshtein distance and seek to

identify matches only at the sentence level

Semantically similar segments are therefore

difficult to retrieve if the string similarity is not

high enough as are sub-segment matches

because if only part of a sentence matches the

input even if this part is an entire clause it is

unlikely that this sentence would be retrieved

(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text

types such as updated versions of technical

manuals

In this study we aim to address the

problem of retrieving sub-segment matches by

performing clause splitting on the source

segment as a pre-processing step for TM match

retrieval While matches for entire sentences

or almost entire sentences are the most useful

type of matches it is also less likely for such

matches to be found in most text types and

even less so for complex sentences Retrieving

clauses is desirable because there is a higher

chance for a match to be found for a clause than

for a complex sentence and at the same time

clauses are similar to sentences in that they

both contain a subject and a verb hence a

ldquocomplete thoughtrdquo therefore clause matches

are more likely to be in context and to actually

be used by the translator than phrase matches

for example

We perform experiments comparing the

match retrieval performance of TM tools when

they are used as is and when the input file and

the TM segments are first processed with a

clause splitter before being fed into the TM

tool The paper is organised as follows In

section 2 we discuss related work on TM

matching In section 3 we discuss how clause

splitting can be beneficial to TM matching and

how clause splitting was implemented for this

study We then describe our experiments in

section 4 and discuss the results in section 5

Finally we present our conclusion and future

work in section 6

Improving Translation Memory Matching through Clause

Splitting

Katerina Timonera

Research Group in Computational

Linguistics

University of Wolverhampton

krtimoneragmailcom

Ruslan Mitkov

Research Group in Computational

Linguistics

University of Wolverhampton

rmitkovwlvacuk

17

2 Related Work

Attempts to address the shortcomings of

existing tools include the integration of

language processing to break down a sentence

into smaller segments The so-called lsquosecond

generationrsquo TM system Similis (Planas 2005)

performs chunking to split sentences into syntagmas to allow sub-sentence matching

However Reinke (2013) observes that for

certain language pairs like English-German

only rather short phrases like simple NPs are

identified and larger syntactic units cannot be

retrieved This can be regarded as disadvantage

as the processing of larger units would be

desirable for the support of professional

computer-assisted human translation (Kriele

(2006) and Macken (2009) cited in Reinke

2013)

MetaMorphoTM (Hodaacutesz and Pohl 2005)

also divides sentences into smaller chunks Moreover it uses a multi-level linguistic

similarity technique (surface form lemma

word class) to determine similarity between

two source-language segments

Other attempts involve deeper linguistic

processing techniques In Pekar and Mitkov

(2007) we propose the lsquothird-generation

translation memoryrsquo which introduces the

concept of semantic matching We employ

syntactic and semantic analysis of segments

stored in a TM to produce a generalised

representation of segments which reduces

equivalent lexical syntactic and lexico-

syntactic constructions into a single

representation Then a retrieval mechanism

operating on these generalised

representations is used to search for useful

previous translations in the TM

This study is part of the third-generation

translation memory project of which the

ultimate goal is to produce more intelligent TM

systems using NLP techniques To the best of

our knowledge clause splitting has not

previously been investigated as a possible

method for increasing the number of relevant

retrieved matches

3 Clause splitting for TM matching

Macklovitch and Russel (2000) note that

when a sufficiently close match cannot be

found for a new input sentence current TM

systems are unable to retrieve sentences that

contain the same clauses or other major

phrases

For example (a) below is a new input

sentence composed of twenty five-character

words The TM contains the sentence (b)

which shares an identical substring with

sentence (a) However as this substring only

makes up only 25 of the sentences total

number of characters it is unlikely that current

TM tools would be able to retrieve it as a fuzzy

match

(a) w1 w2 w3 w4 w5 w6 w20

(b) w1 w2 w3 w4 w5 w21 w35

If clause splitting were employed clauses

would be treated as separate segments thus

increasing the likelihood that clauses which

are subparts of larger units could have a match

score sufficiently high to be retrieved by the

TM system

In this paper we compare the effect of

performing clause splitting before retrieving

matches in a TM In each experiment we

identify the difference in matching

performance when a TM tool is used as is and

when the input file and the translation memory

are first run through a clause splitter The

clause splitter we use in this study is a modified version of the one described in Puscasu (2004)

The original version employs both machine

learning and linguistic rules to identify finite

clauses for both English and Romanian but in

this version only the rule-based module is

used Puscasu (2004) developed a clause

splitting method for both English and

Romanian and to maintain consistency

between the two languages her definition of a

clause is the one prescribed by the Romanian

Academy of Grammar which is that a clause is

18

group of words containing a finite verb

Nonfinite and verbless clauses are therefore

not considered The reported F-measure for

identifying complete clauses in English is

8139 (Marsic 2011)

In this study the clause splitter is used on

both the segments in the input file and the translation memory database After processing

the input and the TM segments with the clause

splitter these were then imported into existing

TM tools to examine how well these tools will

perform if clause splitting is used in pre-

processing

4 Experiments

Experiments were performed to study the

impact of clause splitting when used in pre-

processing for the retrieval of segments in a

TM Our hypothesis is that when a clause

splitter is used the number of relevant

retrieved matches will increase

The effect of clause splitting is examined by

comparing the number of matches retrieved

when TM tools are used as is and when both

the input segments and the segments in the

translation memory are first processed with a

clause splitter before being imported into the

TM tools The tools used are Wordfast

Professional 3 and Trados Studio 2009 which

are among the most widely used TM tools

(Lagoudaki 2006)

Segments used as the input were selected

from the Edinburgh paraphrase corpus (Cohn

Callison-Burch and Lapata 2008) (in Macken

2009) We use a paraphrase corpus because we

wish to investigate the effect of using a clause

splitter in pre-processing to retrieve both

segments that contain the entire input clause

and segments that do not contain the exact

input clause but may still be relevant as they

contain a clause that shares a considerable

degree of similarity with the input

We examine the segments retrieved using

both the default fuzzy match threshold (75

for Wordfast and 70 for SDL Trados) and the

minimum threshold (40 for Wordfast and

30 for Trados) It is not normally

recommended for translators to set a low fuzzy

match threshold as this might result in the

retrieval of too many irrelevant segments if the

translation memory is large However in this

study we argue it would be beneficial to

examine matches retrieved with the minimum

threshold as well Given that translation

memory match scores are mainly calculated

using Levenshtein distance if only one clause

in a segment in the TM matches the input there

is a greater chance of the segment being

retrieved with a lower threshold We therefore

wish to examine whether the employment of

clause splitting will still result in a considerable

improvement from using the Levenshtein

distance-based matching algorithm in most TM

tools if the match threshold setting is already

optimised for the retrieval of sub-segment

clauses

It must also be noted that for this study we

are working with the source segments only

Therefore in the TM files used both the source

and target segments are in English

We conducted two main sets of experiments

referred to as Set A and Set B which are

outlined below In Set A we selected sentences

from the Edinburgh corpus that contained

more than one clause We use one clause or

part of it as the input segment and we store

the entire sentence in the TM In the

experiments where no clause splitting is done

the sentence is stored as is In the experiments

with clause splitting the original input

segments are split into clauses (if there are more than one) and the segments in the TM are

the component clauses of the original sentence

For the experiments done without clause

splitting there are 150 input segments and for

each one we test whether the longer

corresponding segment in the TM can be

retrieved For the experiments where clause

splitting is used the 150 input segments are

split into 180 segments as some of these

segments contain more than one clause We

then test whether the corresponding clause

from the original longer sentence can be

retrieved An example is presented in Table 1

19

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 16: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 9ndash16Hissar Bulgaria Sept 2015

Spotting false translation segments in translation memories

Eduard BarbuTranslatednet

eduardtranslatednet

Abstract

The problem of spotting false translationsin the bi-segments of translation memoriescan be thought of as a classification taskWe test the accuracy of various machinelearning algorithms to find segments thatare not true translations We show thatthe Church-Gale scores in two large bi-segment sets extracted from MyMemorycan be used for finding positive and neg-ative training examples for the machinelearning algorithms The performanceof the winning classification algorithmsthough high is not yet sufficient for auto-matic cleaning of translations memories

1 Introduction

MyMemory1 (Trombetti 2009) is the biggesttranslation memory in the world It contains morethan 1 billion bi-segments in approximately 6000language pairs MyMemory is built using threemethods The first method is to aggregate thememories contributed by translators The secondmethod is to use translation memories extractedfrom corpora glossaries or data mined from theweb The current distribution of the automaticallyacquired translation memories is given in figure1 Approximately 50 of the distribution is oc-cupied by the DGT-TM (Steinberger et al 2013)a translation memory built for 24 EU languagesfrom aligned parallel corpora The glossaries arerepresented by the Unified Medical Language Sys-tem (UMLS) (Humphreys and Lindberg 1993) aterminology released by the National Library ofMedicine The third method is to allow anony-mous contributors to add source segments andtheir translations through a web interface

The quality of the translations using the firstmethod is high and the errors are relatively few

1httpsmymemorytranslatednet

6RXUFHVEUHDNGRZQRIWKHUHFRUGVFRPHIURPXSORDGHG70V

Figure 1 The distribution of automatically ac-quired memories in MyMemory

However the second method and especially thethird one produce a significant number of erro-neous translations The automatically aligned par-allel corpora have alignment errors and the collab-orative translation memories are spammed or havelow quality contributions

The problem of finding bi-segments that are nottrue translations can be stated as a typical classi-fication problem Given a bi-segment a classifiershould return yes if the segments are true transla-tions and no otherwise In this paper we test vari-ous classification algorithms at this task

The rest of the paper has the following struc-ture Section 2 puts our work in the larger contextof research focused on translation memories Sec-tion 3 explains the typical errors that the transla-tion memories which are part of MyMemory con-tain and show how we have built the training andtest sets Section 4 describes the features chosen torepresent the data and briefly describes the classi-fication algorithms employed Section 5 presentsand discusses the results In the final section wedraw the conclusions and plan the further devel-opments

9

2 Related Work

The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval

There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora

A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours

2httpwwwxbenchnet

3 The data

In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types

1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web

2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)

3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments

4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)

The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors

10

It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1

CG =ls minus ldradic

34(ls + ld)(1)

In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments

If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426

The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the

3This simple idea is implemented in many sentence align-ers

4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom

Church Gale ScoreP

rob

ab

ility C

hu

rch

Ga

le S

co

re

minus4 minus2 0 2 4

00

02

04

06

08

Figure 2 The distribution of Church Gale Scoresin the Matecat Set

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

01

02

03

04

Figure 3 The distribution of Church Gale Scoresin the Collaborative Set

11

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

00

00

04

00

08

Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet

Church Gale score varies in the interval minus131516015

To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6

In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows

bull Training Set It contains 1243 bi-segmentsand has 373 negative example

bull Test Set It contains 309 bi-segments and has87 negatives examples

The proportion of the negative examples in bothsets is approximately 30

minus4 minus2 0 2 4

minus4

minus2

02

4Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 5 The Q-Q plot for the Matecat set

minus4 minus2 0 2 4

minus1

00

minus5

00

50

Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 6 The Q-Q plot for the Collaborative set

12

4 Machine Learning

In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them

41 FeaturesThe features computed for the training and test setare the following

bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt

bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language

bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0

bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0

bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0

bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0

bull has capital letters The value of the featureis 1 if the source or target segments contain

words that have at least a capital letter other-wise is 0

bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters

bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions

bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)

bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties

bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity

bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity

13

bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment

bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters

bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters

bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2

All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation

The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment

is done with the highly accurate language detectorCybozu5

42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)

bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree

bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees

bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores

bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms

bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances

bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can

5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd

14

give good results in classification problemswhere the decision boundary is irregular

5 Results and discussion

We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1

Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071

Table 1 The results of the three-fold stratifiedclassification

Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)

The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2

The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set

Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059

Table 2 The results of the classification on the testset

Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section

6 Conclusions and further work

In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the

15

performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database

There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative

Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471

ReferencesWilliam A Gale and Kenneth W Church 1993 A

program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS

B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April

Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-

lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31

Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830

Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226

Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory

Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas

16

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015

Abstract

We propose the integration of clause

splitting as a pre-processing step for match

retrieval in Translation Memory (TM)

systems to increase the number of relevant

sub-segment matches Through a series of

experiments we investigate the impact of

clause splitting in instances where the input

does not match an entire segment in the

TM but only a clause from a segment Our

results show that there is a statistically

significant increase in the number of

retrieved matches when both the input

segments and the segments in the TM are

first processed with a clause splitter

1 Rationale

Translation memory tools have had a great

impact on the translation industry as they

provide considerable assistance to translators

They allow translators to easily re-use

previous translations providing them with

valuable productivity gains in an industry

where there is a great demand for quality

translation delivered in the shortest possible

time However existing tools have some

shortcomings The majority of existing tools

rely on Levenshtein distance and seek to

identify matches only at the sentence level

Semantically similar segments are therefore

difficult to retrieve if the string similarity is not

high enough as are sub-segment matches

because if only part of a sentence matches the

input even if this part is an entire clause it is

unlikely that this sentence would be retrieved

(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text

types such as updated versions of technical

manuals

In this study we aim to address the

problem of retrieving sub-segment matches by

performing clause splitting on the source

segment as a pre-processing step for TM match

retrieval While matches for entire sentences

or almost entire sentences are the most useful

type of matches it is also less likely for such

matches to be found in most text types and

even less so for complex sentences Retrieving

clauses is desirable because there is a higher

chance for a match to be found for a clause than

for a complex sentence and at the same time

clauses are similar to sentences in that they

both contain a subject and a verb hence a

ldquocomplete thoughtrdquo therefore clause matches

are more likely to be in context and to actually

be used by the translator than phrase matches

for example

We perform experiments comparing the

match retrieval performance of TM tools when

they are used as is and when the input file and

the TM segments are first processed with a

clause splitter before being fed into the TM

tool The paper is organised as follows In

section 2 we discuss related work on TM

matching In section 3 we discuss how clause

splitting can be beneficial to TM matching and

how clause splitting was implemented for this

study We then describe our experiments in

section 4 and discuss the results in section 5

Finally we present our conclusion and future

work in section 6

Improving Translation Memory Matching through Clause

Splitting

Katerina Timonera

Research Group in Computational

Linguistics

University of Wolverhampton

krtimoneragmailcom

Ruslan Mitkov

Research Group in Computational

Linguistics

University of Wolverhampton

rmitkovwlvacuk

17

2 Related Work

Attempts to address the shortcomings of

existing tools include the integration of

language processing to break down a sentence

into smaller segments The so-called lsquosecond

generationrsquo TM system Similis (Planas 2005)

performs chunking to split sentences into syntagmas to allow sub-sentence matching

However Reinke (2013) observes that for

certain language pairs like English-German

only rather short phrases like simple NPs are

identified and larger syntactic units cannot be

retrieved This can be regarded as disadvantage

as the processing of larger units would be

desirable for the support of professional

computer-assisted human translation (Kriele

(2006) and Macken (2009) cited in Reinke

2013)

MetaMorphoTM (Hodaacutesz and Pohl 2005)

also divides sentences into smaller chunks Moreover it uses a multi-level linguistic

similarity technique (surface form lemma

word class) to determine similarity between

two source-language segments

Other attempts involve deeper linguistic

processing techniques In Pekar and Mitkov

(2007) we propose the lsquothird-generation

translation memoryrsquo which introduces the

concept of semantic matching We employ

syntactic and semantic analysis of segments

stored in a TM to produce a generalised

representation of segments which reduces

equivalent lexical syntactic and lexico-

syntactic constructions into a single

representation Then a retrieval mechanism

operating on these generalised

representations is used to search for useful

previous translations in the TM

This study is part of the third-generation

translation memory project of which the

ultimate goal is to produce more intelligent TM

systems using NLP techniques To the best of

our knowledge clause splitting has not

previously been investigated as a possible

method for increasing the number of relevant

retrieved matches

3 Clause splitting for TM matching

Macklovitch and Russel (2000) note that

when a sufficiently close match cannot be

found for a new input sentence current TM

systems are unable to retrieve sentences that

contain the same clauses or other major

phrases

For example (a) below is a new input

sentence composed of twenty five-character

words The TM contains the sentence (b)

which shares an identical substring with

sentence (a) However as this substring only

makes up only 25 of the sentences total

number of characters it is unlikely that current

TM tools would be able to retrieve it as a fuzzy

match

(a) w1 w2 w3 w4 w5 w6 w20

(b) w1 w2 w3 w4 w5 w21 w35

If clause splitting were employed clauses

would be treated as separate segments thus

increasing the likelihood that clauses which

are subparts of larger units could have a match

score sufficiently high to be retrieved by the

TM system

In this paper we compare the effect of

performing clause splitting before retrieving

matches in a TM In each experiment we

identify the difference in matching

performance when a TM tool is used as is and

when the input file and the translation memory

are first run through a clause splitter The

clause splitter we use in this study is a modified version of the one described in Puscasu (2004)

The original version employs both machine

learning and linguistic rules to identify finite

clauses for both English and Romanian but in

this version only the rule-based module is

used Puscasu (2004) developed a clause

splitting method for both English and

Romanian and to maintain consistency

between the two languages her definition of a

clause is the one prescribed by the Romanian

Academy of Grammar which is that a clause is

18

group of words containing a finite verb

Nonfinite and verbless clauses are therefore

not considered The reported F-measure for

identifying complete clauses in English is

8139 (Marsic 2011)

In this study the clause splitter is used on

both the segments in the input file and the translation memory database After processing

the input and the TM segments with the clause

splitter these were then imported into existing

TM tools to examine how well these tools will

perform if clause splitting is used in pre-

processing

4 Experiments

Experiments were performed to study the

impact of clause splitting when used in pre-

processing for the retrieval of segments in a

TM Our hypothesis is that when a clause

splitter is used the number of relevant

retrieved matches will increase

The effect of clause splitting is examined by

comparing the number of matches retrieved

when TM tools are used as is and when both

the input segments and the segments in the

translation memory are first processed with a

clause splitter before being imported into the

TM tools The tools used are Wordfast

Professional 3 and Trados Studio 2009 which

are among the most widely used TM tools

(Lagoudaki 2006)

Segments used as the input were selected

from the Edinburgh paraphrase corpus (Cohn

Callison-Burch and Lapata 2008) (in Macken

2009) We use a paraphrase corpus because we

wish to investigate the effect of using a clause

splitter in pre-processing to retrieve both

segments that contain the entire input clause

and segments that do not contain the exact

input clause but may still be relevant as they

contain a clause that shares a considerable

degree of similarity with the input

We examine the segments retrieved using

both the default fuzzy match threshold (75

for Wordfast and 70 for SDL Trados) and the

minimum threshold (40 for Wordfast and

30 for Trados) It is not normally

recommended for translators to set a low fuzzy

match threshold as this might result in the

retrieval of too many irrelevant segments if the

translation memory is large However in this

study we argue it would be beneficial to

examine matches retrieved with the minimum

threshold as well Given that translation

memory match scores are mainly calculated

using Levenshtein distance if only one clause

in a segment in the TM matches the input there

is a greater chance of the segment being

retrieved with a lower threshold We therefore

wish to examine whether the employment of

clause splitting will still result in a considerable

improvement from using the Levenshtein

distance-based matching algorithm in most TM

tools if the match threshold setting is already

optimised for the retrieval of sub-segment

clauses

It must also be noted that for this study we

are working with the source segments only

Therefore in the TM files used both the source

and target segments are in English

We conducted two main sets of experiments

referred to as Set A and Set B which are

outlined below In Set A we selected sentences

from the Edinburgh corpus that contained

more than one clause We use one clause or

part of it as the input segment and we store

the entire sentence in the TM In the

experiments where no clause splitting is done

the sentence is stored as is In the experiments

with clause splitting the original input

segments are split into clauses (if there are more than one) and the segments in the TM are

the component clauses of the original sentence

For the experiments done without clause

splitting there are 150 input segments and for

each one we test whether the longer

corresponding segment in the TM can be

retrieved For the experiments where clause

splitting is used the 150 input segments are

split into 180 segments as some of these

segments contain more than one clause We

then test whether the corresponding clause

from the original longer sentence can be

retrieved An example is presented in Table 1

19

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 17: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

2 Related Work

The translation memory systems are extensivelyused today The main tasks they help accomplishare localization of digital information and transla-tion (Reinke 2013) Because translation memo-ries are stored in databases the principal optimiza-tion from a technical point of view is the speed ofretrieval

There are two not technical requirements thatthe translation memories systems should fulfillthat interest the research community the accu-racy of retrieval and the translation memory clean-ing If for improving the accuracy of retrievedsegments there is a fair amount of work (eg(Zhechev and van Genabith 2010) (Koehn andSenellart 2010)) to the best of our knowledge thememory cleaning is a neglected research area Tobe fair there are software tools that incorporatebasic methods of data cleaning We would liketo mention Apsic X-Bench2 Apsic X-Bench im-plements a series of syntactic checks for the seg-ments It checks for example if the opened tag isclosed if a word is repeated or if a word is mis-spelled It also integrates terminological dictio-naries and verifies if the terms are translated ac-curately The main assumptions behind these val-idations seem to be that the translation memoriesbi-segments contain accidental errors (eg tags notclosed) or that the translators sometimes use inac-curate terms that can be spotted with a bilingualterminology These assumptions hold for transla-tion memories produced by professional transla-tors but not for collaborative memories and mem-ories derived from parallel corpora

A task somehow similar to translation memorycleaning as envisioned in section 1 is Quality Es-timation in Machine Translation Quality Estima-tion can also be modeled as a classification taskwhere the goal is to distinguish between accu-rate and inaccurate translations (Li and Khudan-pur 2009) The difference is that the sentenceswhose quality should be estimated are producedby Machine Translations systems and not by hu-mans Therefore the features that help to discrimi-nate between good and bad translations in this ap-proach are different from those in ours

2httpwwwxbenchnet

3 The data

In this section we describe the process of obtain-ing the data for training and testing the classi-fiers The positive training examples are segmentswhere the source segment is correctly translatedby the target segment The negative training ex-amples are translation memory segments that arenot true translations Before explaining how wecollected the examples it is useful to understandwhat kind of errors the translation memories partof MyMemory contain They can be roughly clas-sified in the four types

1 Random Text The Random Text errors arecases when one or both segments isare a ran-dom text They occur when a malevolent con-tributor uses the platform to copy and pasterandom texts from the web

2 Chat This type of errors verifies when thetranslation memory contributors exchangemessages instead of providing translationsFor example the English text ldquoHow are yourdquotranslates in Italian as ldquoCome stairdquo Insteadof providing the translation the contributoranswers ldquoBenerdquo (ldquoFinerdquo)

3 Language Error This kind of errors oc-curs when the languages of the source or tar-get segments are mistaken The contribu-tors accidentally interchange the languages ofsource and target segments We would like torecover from this error and pass to the clas-sifier the correct source and target segmentsThere are also cases when a different lan-guage code is assigned to the source or targetsegment This happens when the parallel cor-pora contain segments in multiple languages(eg the English part of the corpus containssegments in French) The aligner does notcheck the language code of the aligned seg-ments

4 Partial Translations This error verifieswhen the contributors translate only a part ofthe source segment For example the En-glish source segment ldquoEarly 1980s MuirfieldCCrdquo is translated in Italian partially ldquoPrimianni 1980rdquo (ldquoEarly 1980srdquo)

The errors Random Text and Chat take placein the collaborative strategy of enriching MyMem-ory The Language Error and Partial Transla-tions are pervasive errors

10

It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1

CG =ls minus ldradic

34(ls + ld)(1)

In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments

If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426

The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the

3This simple idea is implemented in many sentence align-ers

4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom

Church Gale ScoreP

rob

ab

ility C

hu

rch

Ga

le S

co

re

minus4 minus2 0 2 4

00

02

04

06

08

Figure 2 The distribution of Church Gale Scoresin the Matecat Set

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

01

02

03

04

Figure 3 The distribution of Church Gale Scoresin the Collaborative Set

11

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

00

00

04

00

08

Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet

Church Gale score varies in the interval minus131516015

To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6

In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows

bull Training Set It contains 1243 bi-segmentsand has 373 negative example

bull Test Set It contains 309 bi-segments and has87 negatives examples

The proportion of the negative examples in bothsets is approximately 30

minus4 minus2 0 2 4

minus4

minus2

02

4Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 5 The Q-Q plot for the Matecat set

minus4 minus2 0 2 4

minus1

00

minus5

00

50

Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 6 The Q-Q plot for the Collaborative set

12

4 Machine Learning

In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them

41 FeaturesThe features computed for the training and test setare the following

bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt

bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language

bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0

bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0

bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0

bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0

bull has capital letters The value of the featureis 1 if the source or target segments contain

words that have at least a capital letter other-wise is 0

bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters

bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions

bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)

bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties

bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity

bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity

13

bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment

bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters

bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters

bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2

All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation

The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment

is done with the highly accurate language detectorCybozu5

42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)

bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree

bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees

bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores

bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms

bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances

bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can

5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd

14

give good results in classification problemswhere the decision boundary is irregular

5 Results and discussion

We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1

Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071

Table 1 The results of the three-fold stratifiedclassification

Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)

The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2

The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set

Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059

Table 2 The results of the classification on the testset

Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section

6 Conclusions and further work

In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the

15

performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database

There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative

Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471

ReferencesWilliam A Gale and Kenneth W Church 1993 A

program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS

B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April

Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-

lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31

Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830

Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226

Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory

Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas

16

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015

Abstract

We propose the integration of clause

splitting as a pre-processing step for match

retrieval in Translation Memory (TM)

systems to increase the number of relevant

sub-segment matches Through a series of

experiments we investigate the impact of

clause splitting in instances where the input

does not match an entire segment in the

TM but only a clause from a segment Our

results show that there is a statistically

significant increase in the number of

retrieved matches when both the input

segments and the segments in the TM are

first processed with a clause splitter

1 Rationale

Translation memory tools have had a great

impact on the translation industry as they

provide considerable assistance to translators

They allow translators to easily re-use

previous translations providing them with

valuable productivity gains in an industry

where there is a great demand for quality

translation delivered in the shortest possible

time However existing tools have some

shortcomings The majority of existing tools

rely on Levenshtein distance and seek to

identify matches only at the sentence level

Semantically similar segments are therefore

difficult to retrieve if the string similarity is not

high enough as are sub-segment matches

because if only part of a sentence matches the

input even if this part is an entire clause it is

unlikely that this sentence would be retrieved

(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text

types such as updated versions of technical

manuals

In this study we aim to address the

problem of retrieving sub-segment matches by

performing clause splitting on the source

segment as a pre-processing step for TM match

retrieval While matches for entire sentences

or almost entire sentences are the most useful

type of matches it is also less likely for such

matches to be found in most text types and

even less so for complex sentences Retrieving

clauses is desirable because there is a higher

chance for a match to be found for a clause than

for a complex sentence and at the same time

clauses are similar to sentences in that they

both contain a subject and a verb hence a

ldquocomplete thoughtrdquo therefore clause matches

are more likely to be in context and to actually

be used by the translator than phrase matches

for example

We perform experiments comparing the

match retrieval performance of TM tools when

they are used as is and when the input file and

the TM segments are first processed with a

clause splitter before being fed into the TM

tool The paper is organised as follows In

section 2 we discuss related work on TM

matching In section 3 we discuss how clause

splitting can be beneficial to TM matching and

how clause splitting was implemented for this

study We then describe our experiments in

section 4 and discuss the results in section 5

Finally we present our conclusion and future

work in section 6

Improving Translation Memory Matching through Clause

Splitting

Katerina Timonera

Research Group in Computational

Linguistics

University of Wolverhampton

krtimoneragmailcom

Ruslan Mitkov

Research Group in Computational

Linguistics

University of Wolverhampton

rmitkovwlvacuk

17

2 Related Work

Attempts to address the shortcomings of

existing tools include the integration of

language processing to break down a sentence

into smaller segments The so-called lsquosecond

generationrsquo TM system Similis (Planas 2005)

performs chunking to split sentences into syntagmas to allow sub-sentence matching

However Reinke (2013) observes that for

certain language pairs like English-German

only rather short phrases like simple NPs are

identified and larger syntactic units cannot be

retrieved This can be regarded as disadvantage

as the processing of larger units would be

desirable for the support of professional

computer-assisted human translation (Kriele

(2006) and Macken (2009) cited in Reinke

2013)

MetaMorphoTM (Hodaacutesz and Pohl 2005)

also divides sentences into smaller chunks Moreover it uses a multi-level linguistic

similarity technique (surface form lemma

word class) to determine similarity between

two source-language segments

Other attempts involve deeper linguistic

processing techniques In Pekar and Mitkov

(2007) we propose the lsquothird-generation

translation memoryrsquo which introduces the

concept of semantic matching We employ

syntactic and semantic analysis of segments

stored in a TM to produce a generalised

representation of segments which reduces

equivalent lexical syntactic and lexico-

syntactic constructions into a single

representation Then a retrieval mechanism

operating on these generalised

representations is used to search for useful

previous translations in the TM

This study is part of the third-generation

translation memory project of which the

ultimate goal is to produce more intelligent TM

systems using NLP techniques To the best of

our knowledge clause splitting has not

previously been investigated as a possible

method for increasing the number of relevant

retrieved matches

3 Clause splitting for TM matching

Macklovitch and Russel (2000) note that

when a sufficiently close match cannot be

found for a new input sentence current TM

systems are unable to retrieve sentences that

contain the same clauses or other major

phrases

For example (a) below is a new input

sentence composed of twenty five-character

words The TM contains the sentence (b)

which shares an identical substring with

sentence (a) However as this substring only

makes up only 25 of the sentences total

number of characters it is unlikely that current

TM tools would be able to retrieve it as a fuzzy

match

(a) w1 w2 w3 w4 w5 w6 w20

(b) w1 w2 w3 w4 w5 w21 w35

If clause splitting were employed clauses

would be treated as separate segments thus

increasing the likelihood that clauses which

are subparts of larger units could have a match

score sufficiently high to be retrieved by the

TM system

In this paper we compare the effect of

performing clause splitting before retrieving

matches in a TM In each experiment we

identify the difference in matching

performance when a TM tool is used as is and

when the input file and the translation memory

are first run through a clause splitter The

clause splitter we use in this study is a modified version of the one described in Puscasu (2004)

The original version employs both machine

learning and linguistic rules to identify finite

clauses for both English and Romanian but in

this version only the rule-based module is

used Puscasu (2004) developed a clause

splitting method for both English and

Romanian and to maintain consistency

between the two languages her definition of a

clause is the one prescribed by the Romanian

Academy of Grammar which is that a clause is

18

group of words containing a finite verb

Nonfinite and verbless clauses are therefore

not considered The reported F-measure for

identifying complete clauses in English is

8139 (Marsic 2011)

In this study the clause splitter is used on

both the segments in the input file and the translation memory database After processing

the input and the TM segments with the clause

splitter these were then imported into existing

TM tools to examine how well these tools will

perform if clause splitting is used in pre-

processing

4 Experiments

Experiments were performed to study the

impact of clause splitting when used in pre-

processing for the retrieval of segments in a

TM Our hypothesis is that when a clause

splitter is used the number of relevant

retrieved matches will increase

The effect of clause splitting is examined by

comparing the number of matches retrieved

when TM tools are used as is and when both

the input segments and the segments in the

translation memory are first processed with a

clause splitter before being imported into the

TM tools The tools used are Wordfast

Professional 3 and Trados Studio 2009 which

are among the most widely used TM tools

(Lagoudaki 2006)

Segments used as the input were selected

from the Edinburgh paraphrase corpus (Cohn

Callison-Burch and Lapata 2008) (in Macken

2009) We use a paraphrase corpus because we

wish to investigate the effect of using a clause

splitter in pre-processing to retrieve both

segments that contain the entire input clause

and segments that do not contain the exact

input clause but may still be relevant as they

contain a clause that shares a considerable

degree of similarity with the input

We examine the segments retrieved using

both the default fuzzy match threshold (75

for Wordfast and 70 for SDL Trados) and the

minimum threshold (40 for Wordfast and

30 for Trados) It is not normally

recommended for translators to set a low fuzzy

match threshold as this might result in the

retrieval of too many irrelevant segments if the

translation memory is large However in this

study we argue it would be beneficial to

examine matches retrieved with the minimum

threshold as well Given that translation

memory match scores are mainly calculated

using Levenshtein distance if only one clause

in a segment in the TM matches the input there

is a greater chance of the segment being

retrieved with a lower threshold We therefore

wish to examine whether the employment of

clause splitting will still result in a considerable

improvement from using the Levenshtein

distance-based matching algorithm in most TM

tools if the match threshold setting is already

optimised for the retrieval of sub-segment

clauses

It must also be noted that for this study we

are working with the source segments only

Therefore in the TM files used both the source

and target segments are in English

We conducted two main sets of experiments

referred to as Set A and Set B which are

outlined below In Set A we selected sentences

from the Edinburgh corpus that contained

more than one clause We use one clause or

part of it as the input segment and we store

the entire sentence in the TM In the

experiments where no clause splitting is done

the sentence is stored as is In the experiments

with clause splitting the original input

segments are split into clauses (if there are more than one) and the segments in the TM are

the component clauses of the original sentence

For the experiments done without clause

splitting there are 150 input segments and for

each one we test whether the longer

corresponding segment in the TM can be

retrieved For the experiments where clause

splitting is used the 150 input segments are

split into 180 segments as some of these

segments contain more than one clause We

then test whether the corresponding clause

from the original longer sentence can be

retrieved An example is presented in Table 1

19

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 18: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

It is relatively easy to find positive examples be-cause the high majority of bi-segments are cor-rect Finding good negative examples is not soeasy as it requires reading a lot of translation seg-ments Inspecting small samples of bi-segmentscorresponding to the three methods we noticedthat the highest percentage of errors come fromthe collaborative web interface To verify that thisis indeed the case we make use of an insight firsttime articulated by Church and Gale (Gale andChurch 1993) The idea is that in a parallel cor-pus the corresponding segments have roughly thesame length3 To quantify the difference betweenthe length of the source and destination segmentswe use a modified Church-Gale length difference(Tiedemann 2011) presented in equation 1

CG =ls minus ldradic

34(ls + ld)(1)

In figures 2 and 3 we plot the distribution of therelative frequency of Church Gale scores for twosets of bi-segments with source segments in En-glish and target segments in Italian The first setfrom now on called the Matecat Set is a set of seg-ments extracted from the output of Matecat4 Thebi-segments of this set are produced by profes-sional translators and have few errors The otherbi-segment set from now on called the Collabora-tive Set is a set of collaborative bi-segments

If it is true that the sets come from different dis-tributions then the plots should be different Thisis indeed the case The plot for the Matecat Set isa little bit skewed to the right but close to a normalplot In figure 2 we plot the Church Gale scoreobtained for the bi-segments of the Matecat setadding a normal curve over the histogram to bettervisualize the difference from the gaussian curveFor the Matecat set the Church Gale score variesin the interval minus418 426

The plot for the Collaborative Set has the distri-bution of scores concentrated in the center as canbe seen in 3 In figure 4 we add a normal curve tothe the previous histogram The relative frequencyof the scores away from the center is much lowerthan the scores in the center Therefore to get abetter wiew of the distribution the y axis is reducedto the interval 001 For the Collaborative set the

3This simple idea is implemented in many sentence align-ers

4Matecat is a free web based CAT tool that can be used atthe following address httpswwwmatecatcom

Church Gale ScoreP

rob

ab

ility C

hu

rch

Ga

le S

co

re

minus4 minus2 0 2 4

00

02

04

06

08

Figure 2 The distribution of Church Gale Scoresin the Matecat Set

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

01

02

03

04

Figure 3 The distribution of Church Gale Scoresin the Collaborative Set

11

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

00

00

04

00

08

Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet

Church Gale score varies in the interval minus131516015

To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6

In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows

bull Training Set It contains 1243 bi-segmentsand has 373 negative example

bull Test Set It contains 309 bi-segments and has87 negatives examples

The proportion of the negative examples in bothsets is approximately 30

minus4 minus2 0 2 4

minus4

minus2

02

4Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 5 The Q-Q plot for the Matecat set

minus4 minus2 0 2 4

minus1

00

minus5

00

50

Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 6 The Q-Q plot for the Collaborative set

12

4 Machine Learning

In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them

41 FeaturesThe features computed for the training and test setare the following

bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt

bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language

bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0

bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0

bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0

bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0

bull has capital letters The value of the featureis 1 if the source or target segments contain

words that have at least a capital letter other-wise is 0

bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters

bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions

bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)

bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties

bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity

bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity

13

bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment

bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters

bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters

bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2

All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation

The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment

is done with the highly accurate language detectorCybozu5

42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)

bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree

bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees

bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores

bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms

bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances

bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can

5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd

14

give good results in classification problemswhere the decision boundary is irregular

5 Results and discussion

We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1

Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071

Table 1 The results of the three-fold stratifiedclassification

Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)

The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2

The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set

Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059

Table 2 The results of the classification on the testset

Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section

6 Conclusions and further work

In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the

15

performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database

There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative

Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471

ReferencesWilliam A Gale and Kenneth W Church 1993 A

program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS

B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April

Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-

lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31

Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830

Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226

Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory

Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas

16

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015

Abstract

We propose the integration of clause

splitting as a pre-processing step for match

retrieval in Translation Memory (TM)

systems to increase the number of relevant

sub-segment matches Through a series of

experiments we investigate the impact of

clause splitting in instances where the input

does not match an entire segment in the

TM but only a clause from a segment Our

results show that there is a statistically

significant increase in the number of

retrieved matches when both the input

segments and the segments in the TM are

first processed with a clause splitter

1 Rationale

Translation memory tools have had a great

impact on the translation industry as they

provide considerable assistance to translators

They allow translators to easily re-use

previous translations providing them with

valuable productivity gains in an industry

where there is a great demand for quality

translation delivered in the shortest possible

time However existing tools have some

shortcomings The majority of existing tools

rely on Levenshtein distance and seek to

identify matches only at the sentence level

Semantically similar segments are therefore

difficult to retrieve if the string similarity is not

high enough as are sub-segment matches

because if only part of a sentence matches the

input even if this part is an entire clause it is

unlikely that this sentence would be retrieved

(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text

types such as updated versions of technical

manuals

In this study we aim to address the

problem of retrieving sub-segment matches by

performing clause splitting on the source

segment as a pre-processing step for TM match

retrieval While matches for entire sentences

or almost entire sentences are the most useful

type of matches it is also less likely for such

matches to be found in most text types and

even less so for complex sentences Retrieving

clauses is desirable because there is a higher

chance for a match to be found for a clause than

for a complex sentence and at the same time

clauses are similar to sentences in that they

both contain a subject and a verb hence a

ldquocomplete thoughtrdquo therefore clause matches

are more likely to be in context and to actually

be used by the translator than phrase matches

for example

We perform experiments comparing the

match retrieval performance of TM tools when

they are used as is and when the input file and

the TM segments are first processed with a

clause splitter before being fed into the TM

tool The paper is organised as follows In

section 2 we discuss related work on TM

matching In section 3 we discuss how clause

splitting can be beneficial to TM matching and

how clause splitting was implemented for this

study We then describe our experiments in

section 4 and discuss the results in section 5

Finally we present our conclusion and future

work in section 6

Improving Translation Memory Matching through Clause

Splitting

Katerina Timonera

Research Group in Computational

Linguistics

University of Wolverhampton

krtimoneragmailcom

Ruslan Mitkov

Research Group in Computational

Linguistics

University of Wolverhampton

rmitkovwlvacuk

17

2 Related Work

Attempts to address the shortcomings of

existing tools include the integration of

language processing to break down a sentence

into smaller segments The so-called lsquosecond

generationrsquo TM system Similis (Planas 2005)

performs chunking to split sentences into syntagmas to allow sub-sentence matching

However Reinke (2013) observes that for

certain language pairs like English-German

only rather short phrases like simple NPs are

identified and larger syntactic units cannot be

retrieved This can be regarded as disadvantage

as the processing of larger units would be

desirable for the support of professional

computer-assisted human translation (Kriele

(2006) and Macken (2009) cited in Reinke

2013)

MetaMorphoTM (Hodaacutesz and Pohl 2005)

also divides sentences into smaller chunks Moreover it uses a multi-level linguistic

similarity technique (surface form lemma

word class) to determine similarity between

two source-language segments

Other attempts involve deeper linguistic

processing techniques In Pekar and Mitkov

(2007) we propose the lsquothird-generation

translation memoryrsquo which introduces the

concept of semantic matching We employ

syntactic and semantic analysis of segments

stored in a TM to produce a generalised

representation of segments which reduces

equivalent lexical syntactic and lexico-

syntactic constructions into a single

representation Then a retrieval mechanism

operating on these generalised

representations is used to search for useful

previous translations in the TM

This study is part of the third-generation

translation memory project of which the

ultimate goal is to produce more intelligent TM

systems using NLP techniques To the best of

our knowledge clause splitting has not

previously been investigated as a possible

method for increasing the number of relevant

retrieved matches

3 Clause splitting for TM matching

Macklovitch and Russel (2000) note that

when a sufficiently close match cannot be

found for a new input sentence current TM

systems are unable to retrieve sentences that

contain the same clauses or other major

phrases

For example (a) below is a new input

sentence composed of twenty five-character

words The TM contains the sentence (b)

which shares an identical substring with

sentence (a) However as this substring only

makes up only 25 of the sentences total

number of characters it is unlikely that current

TM tools would be able to retrieve it as a fuzzy

match

(a) w1 w2 w3 w4 w5 w6 w20

(b) w1 w2 w3 w4 w5 w21 w35

If clause splitting were employed clauses

would be treated as separate segments thus

increasing the likelihood that clauses which

are subparts of larger units could have a match

score sufficiently high to be retrieved by the

TM system

In this paper we compare the effect of

performing clause splitting before retrieving

matches in a TM In each experiment we

identify the difference in matching

performance when a TM tool is used as is and

when the input file and the translation memory

are first run through a clause splitter The

clause splitter we use in this study is a modified version of the one described in Puscasu (2004)

The original version employs both machine

learning and linguistic rules to identify finite

clauses for both English and Romanian but in

this version only the rule-based module is

used Puscasu (2004) developed a clause

splitting method for both English and

Romanian and to maintain consistency

between the two languages her definition of a

clause is the one prescribed by the Romanian

Academy of Grammar which is that a clause is

18

group of words containing a finite verb

Nonfinite and verbless clauses are therefore

not considered The reported F-measure for

identifying complete clauses in English is

8139 (Marsic 2011)

In this study the clause splitter is used on

both the segments in the input file and the translation memory database After processing

the input and the TM segments with the clause

splitter these were then imported into existing

TM tools to examine how well these tools will

perform if clause splitting is used in pre-

processing

4 Experiments

Experiments were performed to study the

impact of clause splitting when used in pre-

processing for the retrieval of segments in a

TM Our hypothesis is that when a clause

splitter is used the number of relevant

retrieved matches will increase

The effect of clause splitting is examined by

comparing the number of matches retrieved

when TM tools are used as is and when both

the input segments and the segments in the

translation memory are first processed with a

clause splitter before being imported into the

TM tools The tools used are Wordfast

Professional 3 and Trados Studio 2009 which

are among the most widely used TM tools

(Lagoudaki 2006)

Segments used as the input were selected

from the Edinburgh paraphrase corpus (Cohn

Callison-Burch and Lapata 2008) (in Macken

2009) We use a paraphrase corpus because we

wish to investigate the effect of using a clause

splitter in pre-processing to retrieve both

segments that contain the entire input clause

and segments that do not contain the exact

input clause but may still be relevant as they

contain a clause that shares a considerable

degree of similarity with the input

We examine the segments retrieved using

both the default fuzzy match threshold (75

for Wordfast and 70 for SDL Trados) and the

minimum threshold (40 for Wordfast and

30 for Trados) It is not normally

recommended for translators to set a low fuzzy

match threshold as this might result in the

retrieval of too many irrelevant segments if the

translation memory is large However in this

study we argue it would be beneficial to

examine matches retrieved with the minimum

threshold as well Given that translation

memory match scores are mainly calculated

using Levenshtein distance if only one clause

in a segment in the TM matches the input there

is a greater chance of the segment being

retrieved with a lower threshold We therefore

wish to examine whether the employment of

clause splitting will still result in a considerable

improvement from using the Levenshtein

distance-based matching algorithm in most TM

tools if the match threshold setting is already

optimised for the retrieval of sub-segment

clauses

It must also be noted that for this study we

are working with the source segments only

Therefore in the TM files used both the source

and target segments are in English

We conducted two main sets of experiments

referred to as Set A and Set B which are

outlined below In Set A we selected sentences

from the Edinburgh corpus that contained

more than one clause We use one clause or

part of it as the input segment and we store

the entire sentence in the TM In the

experiments where no clause splitting is done

the sentence is stored as is In the experiments

with clause splitting the original input

segments are split into clauses (if there are more than one) and the segments in the TM are

the component clauses of the original sentence

For the experiments done without clause

splitting there are 150 input segments and for

each one we test whether the longer

corresponding segment in the TM can be

retrieved For the experiments where clause

splitting is used the 150 input segments are

split into 180 segments as some of these

segments contain more than one clause We

then test whether the corresponding clause

from the original longer sentence can be

retrieved An example is presented in Table 1

19

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 19: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

Church Gale Score

Pro

ba

bility C

hu

rch

Ga

le S

co

re

minus100 minus50 0 50

00

00

00

04

00

08

Figure 4 The normal curve added to the distri-bution of Church Gale Scores in the CollaborativeSet

Church Gale score varies in the interval minus131516015

To see how close the distribution of Church-Gale scores is to a normal distribution we haveplotted these distributions against the normal dis-tribution using the Quantile to Quantile plot in fig-ures 5 and 6

In the Collaborative Set the scores that have alow probability could be a source of errors Tobuild the training set we first draw random bi-segments from the Matecat Set As said beforethe bi-segments in the Matecat Set should containmainly positive examples Second we draw ran-dom bi-segments from the Collaborative Set bi-asing the sampling to the bi-segments that havescores away from the center of the distribution Inthis way we hope that we draw enough negativesegments After manually validating the exampleswe created a training set and a test set distributedas follows

bull Training Set It contains 1243 bi-segmentsand has 373 negative example

bull Test Set It contains 309 bi-segments and has87 negatives examples

The proportion of the negative examples in bothsets is approximately 30

minus4 minus2 0 2 4

minus4

minus2

02

4Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 5 The Q-Q plot for the Matecat set

minus4 minus2 0 2 4

minus1

00

minus5

00

50

Theoretical Quantiles

Sa

mp

le Q

ua

ntile

s

Figure 6 The Q-Q plot for the Collaborative set

12

4 Machine Learning

In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them

41 FeaturesThe features computed for the training and test setare the following

bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt

bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language

bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0

bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0

bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0

bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0

bull has capital letters The value of the featureis 1 if the source or target segments contain

words that have at least a capital letter other-wise is 0

bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters

bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions

bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)

bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties

bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity

bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity

13

bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment

bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters

bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters

bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2

All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation

The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment

is done with the highly accurate language detectorCybozu5

42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)

bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree

bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees

bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores

bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms

bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances

bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can

5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd

14

give good results in classification problemswhere the decision boundary is irregular

5 Results and discussion

We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1

Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071

Table 1 The results of the three-fold stratifiedclassification

Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)

The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2

The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set

Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059

Table 2 The results of the classification on the testset

Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section

6 Conclusions and further work

In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the

15

performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database

There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative

Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471

ReferencesWilliam A Gale and Kenneth W Church 1993 A

program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS

B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April

Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-

lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31

Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830

Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226

Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory

Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas

16

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015

Abstract

We propose the integration of clause

splitting as a pre-processing step for match

retrieval in Translation Memory (TM)

systems to increase the number of relevant

sub-segment matches Through a series of

experiments we investigate the impact of

clause splitting in instances where the input

does not match an entire segment in the

TM but only a clause from a segment Our

results show that there is a statistically

significant increase in the number of

retrieved matches when both the input

segments and the segments in the TM are

first processed with a clause splitter

1 Rationale

Translation memory tools have had a great

impact on the translation industry as they

provide considerable assistance to translators

They allow translators to easily re-use

previous translations providing them with

valuable productivity gains in an industry

where there is a great demand for quality

translation delivered in the shortest possible

time However existing tools have some

shortcomings The majority of existing tools

rely on Levenshtein distance and seek to

identify matches only at the sentence level

Semantically similar segments are therefore

difficult to retrieve if the string similarity is not

high enough as are sub-segment matches

because if only part of a sentence matches the

input even if this part is an entire clause it is

unlikely that this sentence would be retrieved

(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text

types such as updated versions of technical

manuals

In this study we aim to address the

problem of retrieving sub-segment matches by

performing clause splitting on the source

segment as a pre-processing step for TM match

retrieval While matches for entire sentences

or almost entire sentences are the most useful

type of matches it is also less likely for such

matches to be found in most text types and

even less so for complex sentences Retrieving

clauses is desirable because there is a higher

chance for a match to be found for a clause than

for a complex sentence and at the same time

clauses are similar to sentences in that they

both contain a subject and a verb hence a

ldquocomplete thoughtrdquo therefore clause matches

are more likely to be in context and to actually

be used by the translator than phrase matches

for example

We perform experiments comparing the

match retrieval performance of TM tools when

they are used as is and when the input file and

the TM segments are first processed with a

clause splitter before being fed into the TM

tool The paper is organised as follows In

section 2 we discuss related work on TM

matching In section 3 we discuss how clause

splitting can be beneficial to TM matching and

how clause splitting was implemented for this

study We then describe our experiments in

section 4 and discuss the results in section 5

Finally we present our conclusion and future

work in section 6

Improving Translation Memory Matching through Clause

Splitting

Katerina Timonera

Research Group in Computational

Linguistics

University of Wolverhampton

krtimoneragmailcom

Ruslan Mitkov

Research Group in Computational

Linguistics

University of Wolverhampton

rmitkovwlvacuk

17

2 Related Work

Attempts to address the shortcomings of

existing tools include the integration of

language processing to break down a sentence

into smaller segments The so-called lsquosecond

generationrsquo TM system Similis (Planas 2005)

performs chunking to split sentences into syntagmas to allow sub-sentence matching

However Reinke (2013) observes that for

certain language pairs like English-German

only rather short phrases like simple NPs are

identified and larger syntactic units cannot be

retrieved This can be regarded as disadvantage

as the processing of larger units would be

desirable for the support of professional

computer-assisted human translation (Kriele

(2006) and Macken (2009) cited in Reinke

2013)

MetaMorphoTM (Hodaacutesz and Pohl 2005)

also divides sentences into smaller chunks Moreover it uses a multi-level linguistic

similarity technique (surface form lemma

word class) to determine similarity between

two source-language segments

Other attempts involve deeper linguistic

processing techniques In Pekar and Mitkov

(2007) we propose the lsquothird-generation

translation memoryrsquo which introduces the

concept of semantic matching We employ

syntactic and semantic analysis of segments

stored in a TM to produce a generalised

representation of segments which reduces

equivalent lexical syntactic and lexico-

syntactic constructions into a single

representation Then a retrieval mechanism

operating on these generalised

representations is used to search for useful

previous translations in the TM

This study is part of the third-generation

translation memory project of which the

ultimate goal is to produce more intelligent TM

systems using NLP techniques To the best of

our knowledge clause splitting has not

previously been investigated as a possible

method for increasing the number of relevant

retrieved matches

3 Clause splitting for TM matching

Macklovitch and Russel (2000) note that

when a sufficiently close match cannot be

found for a new input sentence current TM

systems are unable to retrieve sentences that

contain the same clauses or other major

phrases

For example (a) below is a new input

sentence composed of twenty five-character

words The TM contains the sentence (b)

which shares an identical substring with

sentence (a) However as this substring only

makes up only 25 of the sentences total

number of characters it is unlikely that current

TM tools would be able to retrieve it as a fuzzy

match

(a) w1 w2 w3 w4 w5 w6 w20

(b) w1 w2 w3 w4 w5 w21 w35

If clause splitting were employed clauses

would be treated as separate segments thus

increasing the likelihood that clauses which

are subparts of larger units could have a match

score sufficiently high to be retrieved by the

TM system

In this paper we compare the effect of

performing clause splitting before retrieving

matches in a TM In each experiment we

identify the difference in matching

performance when a TM tool is used as is and

when the input file and the translation memory

are first run through a clause splitter The

clause splitter we use in this study is a modified version of the one described in Puscasu (2004)

The original version employs both machine

learning and linguistic rules to identify finite

clauses for both English and Romanian but in

this version only the rule-based module is

used Puscasu (2004) developed a clause

splitting method for both English and

Romanian and to maintain consistency

between the two languages her definition of a

clause is the one prescribed by the Romanian

Academy of Grammar which is that a clause is

18

group of words containing a finite verb

Nonfinite and verbless clauses are therefore

not considered The reported F-measure for

identifying complete clauses in English is

8139 (Marsic 2011)

In this study the clause splitter is used on

both the segments in the input file and the translation memory database After processing

the input and the TM segments with the clause

splitter these were then imported into existing

TM tools to examine how well these tools will

perform if clause splitting is used in pre-

processing

4 Experiments

Experiments were performed to study the

impact of clause splitting when used in pre-

processing for the retrieval of segments in a

TM Our hypothesis is that when a clause

splitter is used the number of relevant

retrieved matches will increase

The effect of clause splitting is examined by

comparing the number of matches retrieved

when TM tools are used as is and when both

the input segments and the segments in the

translation memory are first processed with a

clause splitter before being imported into the

TM tools The tools used are Wordfast

Professional 3 and Trados Studio 2009 which

are among the most widely used TM tools

(Lagoudaki 2006)

Segments used as the input were selected

from the Edinburgh paraphrase corpus (Cohn

Callison-Burch and Lapata 2008) (in Macken

2009) We use a paraphrase corpus because we

wish to investigate the effect of using a clause

splitter in pre-processing to retrieve both

segments that contain the entire input clause

and segments that do not contain the exact

input clause but may still be relevant as they

contain a clause that shares a considerable

degree of similarity with the input

We examine the segments retrieved using

both the default fuzzy match threshold (75

for Wordfast and 70 for SDL Trados) and the

minimum threshold (40 for Wordfast and

30 for Trados) It is not normally

recommended for translators to set a low fuzzy

match threshold as this might result in the

retrieval of too many irrelevant segments if the

translation memory is large However in this

study we argue it would be beneficial to

examine matches retrieved with the minimum

threshold as well Given that translation

memory match scores are mainly calculated

using Levenshtein distance if only one clause

in a segment in the TM matches the input there

is a greater chance of the segment being

retrieved with a lower threshold We therefore

wish to examine whether the employment of

clause splitting will still result in a considerable

improvement from using the Levenshtein

distance-based matching algorithm in most TM

tools if the match threshold setting is already

optimised for the retrieval of sub-segment

clauses

It must also be noted that for this study we

are working with the source segments only

Therefore in the TM files used both the source

and target segments are in English

We conducted two main sets of experiments

referred to as Set A and Set B which are

outlined below In Set A we selected sentences

from the Edinburgh corpus that contained

more than one clause We use one clause or

part of it as the input segment and we store

the entire sentence in the TM In the

experiments where no clause splitting is done

the sentence is stored as is In the experiments

with clause splitting the original input

segments are split into clauses (if there are more than one) and the segments in the TM are

the component clauses of the original sentence

For the experiments done without clause

splitting there are 150 input segments and for

each one we test whether the longer

corresponding segment in the TM can be

retrieved For the experiments where clause

splitting is used the 150 input segments are

split into 180 segments as some of these

segments contain more than one clause We

then test whether the corresponding clause

from the original longer sentence can be

retrieved An example is presented in Table 1

19

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 20: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

4 Machine Learning

In this section we discuss the features computedfor the training and the test sets Moreover webriefly present the algorithms used for classifica-tion and the rationale for using them

41 FeaturesThe features computed for the training and test setare the following

bull same This feature takes two values 0 and1 It has value 1 if the source and target seg-ments are equal There are cases specificallyin the collaborative part of MyMemory whenthe source segment is copied in the target seg-ment Of course there are perfectly legitimatecases when the source and target segmentsare the same (eg when the source segmentis a name entity that has the same form in thetarget language) but many times the value 1indicates a spam attempt

bull cg score This feature is the Church-Galescore described in the equation 1 This scorereflects the idea that the length of the sourceand destination segments that are true trans-lations is correlated We expect that theclassifiers learn the threshold that separatesthe positive and negative examples How-ever relying exclusively on the Church-Galescore is tricky because there are cases whena high Church Gale score is perfectly legit-imate For example when the acronyms inthe source language are expanded in the tar-get language

bull has url The value of the feature is 1 if thesource or target segments contain an URL ad-dress otherwise is 0

bull has tag The value of the feature is 1 if thesource or target segments contain a tag oth-erwise is 0

bull has email The value of the feature is 1 if thesource or target segments contain an emailaddress otherwise is 0

bull has number The value of the feature is 1 ifthe source or target segments contain a num-ber otherwise is 0

bull has capital letters The value of the featureis 1 if the source or target segments contain

words that have at least a capital letter other-wise is 0

bull has words capital letters The value of thefeature is 1 if the source or target segmentscontain words that consist completely of cap-ital letters otherwise is 0 Unlike the pre-vious feature this one activates only whenthere exists whole words in capital letters

bull punctuation similarity The value of thisfeature is the cosine similarity between thesource and destination segments punctuationvectors The intuition behind this feature isthat source and target segments should havesimilar punctuation vectors if the source seg-ment and the target segment are true transla-tions

bull tag similarity The value of this feature is thecosine similarity between the source segmentand destination segment tag vectors The rea-son for introducing this feature is that thesource and target segments should containvery similar tag vectors if they are true trans-lations This feature combines with has tagto exhaust all possibilities (eg the tag existsdoes not exist and if it exists is presentis notpresent in the source and the target segments)

bull email similarity The value of the fea-ture is the cosine similarity between thesource segment and destination segmentemail vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity This feature combines withthe feature has email to exhaust all possibili-ties

bull url similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment url ad-dresses vectors The reasoning for introduc-ing this feature is the same as for the featuretag similarity

bull number similarity The value of the featureis the cosine similarity between the sourcesegment and destination segment numbervectors The reasoning for introducingthis feature is the same as for the featuretag similarity

13

bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment

bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters

bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters

bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2

All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation

The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment

is done with the highly accurate language detectorCybozu5

42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)

bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree

bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees

bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores

bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms

bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances

bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can

5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd

14

give good results in classification problemswhere the decision boundary is irregular

5 Results and discussion

We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1

Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071

Table 1 The results of the three-fold stratifiedclassification

Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)

The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2

The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set

Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059

Table 2 The results of the classification on the testset

Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section

6 Conclusions and further work

In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the

15

performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database

There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative

Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471

ReferencesWilliam A Gale and Kenneth W Church 1993 A

program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS

B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April

Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-

lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31

Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830

Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226

Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory

Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas

16

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015

Abstract

We propose the integration of clause

splitting as a pre-processing step for match

retrieval in Translation Memory (TM)

systems to increase the number of relevant

sub-segment matches Through a series of

experiments we investigate the impact of

clause splitting in instances where the input

does not match an entire segment in the

TM but only a clause from a segment Our

results show that there is a statistically

significant increase in the number of

retrieved matches when both the input

segments and the segments in the TM are

first processed with a clause splitter

1 Rationale

Translation memory tools have had a great

impact on the translation industry as they

provide considerable assistance to translators

They allow translators to easily re-use

previous translations providing them with

valuable productivity gains in an industry

where there is a great demand for quality

translation delivered in the shortest possible

time However existing tools have some

shortcomings The majority of existing tools

rely on Levenshtein distance and seek to

identify matches only at the sentence level

Semantically similar segments are therefore

difficult to retrieve if the string similarity is not

high enough as are sub-segment matches

because if only part of a sentence matches the

input even if this part is an entire clause it is

unlikely that this sentence would be retrieved

(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text

types such as updated versions of technical

manuals

In this study we aim to address the

problem of retrieving sub-segment matches by

performing clause splitting on the source

segment as a pre-processing step for TM match

retrieval While matches for entire sentences

or almost entire sentences are the most useful

type of matches it is also less likely for such

matches to be found in most text types and

even less so for complex sentences Retrieving

clauses is desirable because there is a higher

chance for a match to be found for a clause than

for a complex sentence and at the same time

clauses are similar to sentences in that they

both contain a subject and a verb hence a

ldquocomplete thoughtrdquo therefore clause matches

are more likely to be in context and to actually

be used by the translator than phrase matches

for example

We perform experiments comparing the

match retrieval performance of TM tools when

they are used as is and when the input file and

the TM segments are first processed with a

clause splitter before being fed into the TM

tool The paper is organised as follows In

section 2 we discuss related work on TM

matching In section 3 we discuss how clause

splitting can be beneficial to TM matching and

how clause splitting was implemented for this

study We then describe our experiments in

section 4 and discuss the results in section 5

Finally we present our conclusion and future

work in section 6

Improving Translation Memory Matching through Clause

Splitting

Katerina Timonera

Research Group in Computational

Linguistics

University of Wolverhampton

krtimoneragmailcom

Ruslan Mitkov

Research Group in Computational

Linguistics

University of Wolverhampton

rmitkovwlvacuk

17

2 Related Work

Attempts to address the shortcomings of

existing tools include the integration of

language processing to break down a sentence

into smaller segments The so-called lsquosecond

generationrsquo TM system Similis (Planas 2005)

performs chunking to split sentences into syntagmas to allow sub-sentence matching

However Reinke (2013) observes that for

certain language pairs like English-German

only rather short phrases like simple NPs are

identified and larger syntactic units cannot be

retrieved This can be regarded as disadvantage

as the processing of larger units would be

desirable for the support of professional

computer-assisted human translation (Kriele

(2006) and Macken (2009) cited in Reinke

2013)

MetaMorphoTM (Hodaacutesz and Pohl 2005)

also divides sentences into smaller chunks Moreover it uses a multi-level linguistic

similarity technique (surface form lemma

word class) to determine similarity between

two source-language segments

Other attempts involve deeper linguistic

processing techniques In Pekar and Mitkov

(2007) we propose the lsquothird-generation

translation memoryrsquo which introduces the

concept of semantic matching We employ

syntactic and semantic analysis of segments

stored in a TM to produce a generalised

representation of segments which reduces

equivalent lexical syntactic and lexico-

syntactic constructions into a single

representation Then a retrieval mechanism

operating on these generalised

representations is used to search for useful

previous translations in the TM

This study is part of the third-generation

translation memory project of which the

ultimate goal is to produce more intelligent TM

systems using NLP techniques To the best of

our knowledge clause splitting has not

previously been investigated as a possible

method for increasing the number of relevant

retrieved matches

3 Clause splitting for TM matching

Macklovitch and Russel (2000) note that

when a sufficiently close match cannot be

found for a new input sentence current TM

systems are unable to retrieve sentences that

contain the same clauses or other major

phrases

For example (a) below is a new input

sentence composed of twenty five-character

words The TM contains the sentence (b)

which shares an identical substring with

sentence (a) However as this substring only

makes up only 25 of the sentences total

number of characters it is unlikely that current

TM tools would be able to retrieve it as a fuzzy

match

(a) w1 w2 w3 w4 w5 w6 w20

(b) w1 w2 w3 w4 w5 w21 w35

If clause splitting were employed clauses

would be treated as separate segments thus

increasing the likelihood that clauses which

are subparts of larger units could have a match

score sufficiently high to be retrieved by the

TM system

In this paper we compare the effect of

performing clause splitting before retrieving

matches in a TM In each experiment we

identify the difference in matching

performance when a TM tool is used as is and

when the input file and the translation memory

are first run through a clause splitter The

clause splitter we use in this study is a modified version of the one described in Puscasu (2004)

The original version employs both machine

learning and linguistic rules to identify finite

clauses for both English and Romanian but in

this version only the rule-based module is

used Puscasu (2004) developed a clause

splitting method for both English and

Romanian and to maintain consistency

between the two languages her definition of a

clause is the one prescribed by the Romanian

Academy of Grammar which is that a clause is

18

group of words containing a finite verb

Nonfinite and verbless clauses are therefore

not considered The reported F-measure for

identifying complete clauses in English is

8139 (Marsic 2011)

In this study the clause splitter is used on

both the segments in the input file and the translation memory database After processing

the input and the TM segments with the clause

splitter these were then imported into existing

TM tools to examine how well these tools will

perform if clause splitting is used in pre-

processing

4 Experiments

Experiments were performed to study the

impact of clause splitting when used in pre-

processing for the retrieval of segments in a

TM Our hypothesis is that when a clause

splitter is used the number of relevant

retrieved matches will increase

The effect of clause splitting is examined by

comparing the number of matches retrieved

when TM tools are used as is and when both

the input segments and the segments in the

translation memory are first processed with a

clause splitter before being imported into the

TM tools The tools used are Wordfast

Professional 3 and Trados Studio 2009 which

are among the most widely used TM tools

(Lagoudaki 2006)

Segments used as the input were selected

from the Edinburgh paraphrase corpus (Cohn

Callison-Burch and Lapata 2008) (in Macken

2009) We use a paraphrase corpus because we

wish to investigate the effect of using a clause

splitter in pre-processing to retrieve both

segments that contain the entire input clause

and segments that do not contain the exact

input clause but may still be relevant as they

contain a clause that shares a considerable

degree of similarity with the input

We examine the segments retrieved using

both the default fuzzy match threshold (75

for Wordfast and 70 for SDL Trados) and the

minimum threshold (40 for Wordfast and

30 for Trados) It is not normally

recommended for translators to set a low fuzzy

match threshold as this might result in the

retrieval of too many irrelevant segments if the

translation memory is large However in this

study we argue it would be beneficial to

examine matches retrieved with the minimum

threshold as well Given that translation

memory match scores are mainly calculated

using Levenshtein distance if only one clause

in a segment in the TM matches the input there

is a greater chance of the segment being

retrieved with a lower threshold We therefore

wish to examine whether the employment of

clause splitting will still result in a considerable

improvement from using the Levenshtein

distance-based matching algorithm in most TM

tools if the match threshold setting is already

optimised for the retrieval of sub-segment

clauses

It must also be noted that for this study we

are working with the source segments only

Therefore in the TM files used both the source

and target segments are in English

We conducted two main sets of experiments

referred to as Set A and Set B which are

outlined below In Set A we selected sentences

from the Edinburgh corpus that contained

more than one clause We use one clause or

part of it as the input segment and we store

the entire sentence in the TM In the

experiments where no clause splitting is done

the sentence is stored as is In the experiments

with clause splitting the original input

segments are split into clauses (if there are more than one) and the segments in the TM are

the component clauses of the original sentence

For the experiments done without clause

splitting there are 150 input segments and for

each one we test whether the longer

corresponding segment in the TM can be

retrieved For the experiments where clause

splitting is used the 150 input segments are

split into 180 segments as some of these

segments contain more than one clause We

then test whether the corresponding clause

from the original longer sentence can be

retrieved An example is presented in Table 1

19

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 21: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

bull bisegment similarity The value of the fea-ture is the cosine similarity between the desti-nation segment and the source segment trans-lation in the destination language It formal-izes the idea that if the target segment is atrue translation of the source segment thena machine translation of the source segmentshould be similar to the target segment

bull capital letters word difference The value ofthe feature is the ratio between the differenceof the number of words containing at least acapital letter in the source segment and thetarget segment and the sum of the capital let-ter words in the bi-segment It is complemen-tary to the feature has capital letters

bull only capletters dif The value of the featureis the ratio between the difference of the num-ber of words containing only capital letters inthe source segment and the target segmentsand the sum of the only capital letter wordsin the bi-segment It is complementary to thefeature has words capital letters

bull lang dif The value of the feature is calcu-lated from the language codes declared in thesegment and the language codes detected bya language detector For example if we ex-pect the source segment language code to berdquoenrdquo and the target segment language code tobe rdquoitrdquo and the language detector detects rdquoenrdquoand rdquoitrdquo then the value of the feature is 0 (en-enit-it) If instead the language detector de-tects rdquoenrdquo and rdquofrrdquo then the value of the fea-ture is 1 (en-enit-fr) and if it detects rdquoderdquo andrdquofrrdquo (en-deit-fr) then the value is 2

All feature values are normalized between 0and 1 The most important features are biseg-ment similarity and lang dif The other featuresare either sparse (eg relatively few bi-segmentscontain URLs emails or tags) or they do notdescribe the translation process very accuratelyFor example we assumed that the punctuation inthe source and target segments should be similarwhich is true for many bi-segments Howeverthere are also many bi-segments where the trans-lation of the source segment in the target languagelacks punctuation

The translation of the source English segment toItalian is performed with the Bing API The com-putation of the language codes for the bi-segment

is done with the highly accurate language detectorCybozu5

42 AlgorithmsAs we showed in section 3 there are cases whenthe contributors mistake the language codes of thesource and target segments Nevertheless the seg-ments might be true translations Therefore be-fore applying the machine learning algorithms wefirst invert the source and target segments if theabove situation verifies We tested the followingclassification algorithms from the package scikit-learn (Pedregosa et al 2011)

bull Decision Tree The decision trees are oneof the oldest classification algorithms Evenif they are known to overfit the training datathey have the advantage that the rules inferredare readable by humans This means that wecan tamper with the automatically inferredrules and at least theoretically create a betterdecision tree

bull Random Forest Random forests are ensem-ble classifiers that consist of multiple deci-sion trees The final prediction is the mode ofindividual tree predictions The Random For-est has a lower probability to overfit the datathan the Decision Trees

bull Logistic Regression The Logistic Regres-sion works particularly well when the fea-tures are linearly separable In addition theclassifier is robust to noise avoids overfittingand its output can be interpreted as probabil-ity scores

bull Support Vector Machines with the linearkernel Support Vector Machines are one ofthe most used classification algorithms

bull Gaussian Naive Bayes If the conditionalindependence that the naive Bayes class ofalgorithm postulates holds the training con-verges faster than logistic regression and thealgorithm needs less training instances

bull K-Nearst Neighbors This algorithm classi-fies a new instance based on the distance ithas to k training instances The predictionoutput is the label that classifies the majorityBecause it is a non-parametric method it can

5httpsgithubcomshuyolanguage-detectionblobwikiProjectHomemd

14

give good results in classification problemswhere the decision boundary is irregular

5 Results and discussion

We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1

Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071

Table 1 The results of the three-fold stratifiedclassification

Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)

The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2

The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set

Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059

Table 2 The results of the classification on the testset

Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section

6 Conclusions and further work

In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the

15

performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database

There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative

Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471

ReferencesWilliam A Gale and Kenneth W Church 1993 A

program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS

B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April

Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-

lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31

Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830

Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226

Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory

Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas

16

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015

Abstract

We propose the integration of clause

splitting as a pre-processing step for match

retrieval in Translation Memory (TM)

systems to increase the number of relevant

sub-segment matches Through a series of

experiments we investigate the impact of

clause splitting in instances where the input

does not match an entire segment in the

TM but only a clause from a segment Our

results show that there is a statistically

significant increase in the number of

retrieved matches when both the input

segments and the segments in the TM are

first processed with a clause splitter

1 Rationale

Translation memory tools have had a great

impact on the translation industry as they

provide considerable assistance to translators

They allow translators to easily re-use

previous translations providing them with

valuable productivity gains in an industry

where there is a great demand for quality

translation delivered in the shortest possible

time However existing tools have some

shortcomings The majority of existing tools

rely on Levenshtein distance and seek to

identify matches only at the sentence level

Semantically similar segments are therefore

difficult to retrieve if the string similarity is not

high enough as are sub-segment matches

because if only part of a sentence matches the

input even if this part is an entire clause it is

unlikely that this sentence would be retrieved

(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text

types such as updated versions of technical

manuals

In this study we aim to address the

problem of retrieving sub-segment matches by

performing clause splitting on the source

segment as a pre-processing step for TM match

retrieval While matches for entire sentences

or almost entire sentences are the most useful

type of matches it is also less likely for such

matches to be found in most text types and

even less so for complex sentences Retrieving

clauses is desirable because there is a higher

chance for a match to be found for a clause than

for a complex sentence and at the same time

clauses are similar to sentences in that they

both contain a subject and a verb hence a

ldquocomplete thoughtrdquo therefore clause matches

are more likely to be in context and to actually

be used by the translator than phrase matches

for example

We perform experiments comparing the

match retrieval performance of TM tools when

they are used as is and when the input file and

the TM segments are first processed with a

clause splitter before being fed into the TM

tool The paper is organised as follows In

section 2 we discuss related work on TM

matching In section 3 we discuss how clause

splitting can be beneficial to TM matching and

how clause splitting was implemented for this

study We then describe our experiments in

section 4 and discuss the results in section 5

Finally we present our conclusion and future

work in section 6

Improving Translation Memory Matching through Clause

Splitting

Katerina Timonera

Research Group in Computational

Linguistics

University of Wolverhampton

krtimoneragmailcom

Ruslan Mitkov

Research Group in Computational

Linguistics

University of Wolverhampton

rmitkovwlvacuk

17

2 Related Work

Attempts to address the shortcomings of

existing tools include the integration of

language processing to break down a sentence

into smaller segments The so-called lsquosecond

generationrsquo TM system Similis (Planas 2005)

performs chunking to split sentences into syntagmas to allow sub-sentence matching

However Reinke (2013) observes that for

certain language pairs like English-German

only rather short phrases like simple NPs are

identified and larger syntactic units cannot be

retrieved This can be regarded as disadvantage

as the processing of larger units would be

desirable for the support of professional

computer-assisted human translation (Kriele

(2006) and Macken (2009) cited in Reinke

2013)

MetaMorphoTM (Hodaacutesz and Pohl 2005)

also divides sentences into smaller chunks Moreover it uses a multi-level linguistic

similarity technique (surface form lemma

word class) to determine similarity between

two source-language segments

Other attempts involve deeper linguistic

processing techniques In Pekar and Mitkov

(2007) we propose the lsquothird-generation

translation memoryrsquo which introduces the

concept of semantic matching We employ

syntactic and semantic analysis of segments

stored in a TM to produce a generalised

representation of segments which reduces

equivalent lexical syntactic and lexico-

syntactic constructions into a single

representation Then a retrieval mechanism

operating on these generalised

representations is used to search for useful

previous translations in the TM

This study is part of the third-generation

translation memory project of which the

ultimate goal is to produce more intelligent TM

systems using NLP techniques To the best of

our knowledge clause splitting has not

previously been investigated as a possible

method for increasing the number of relevant

retrieved matches

3 Clause splitting for TM matching

Macklovitch and Russel (2000) note that

when a sufficiently close match cannot be

found for a new input sentence current TM

systems are unable to retrieve sentences that

contain the same clauses or other major

phrases

For example (a) below is a new input

sentence composed of twenty five-character

words The TM contains the sentence (b)

which shares an identical substring with

sentence (a) However as this substring only

makes up only 25 of the sentences total

number of characters it is unlikely that current

TM tools would be able to retrieve it as a fuzzy

match

(a) w1 w2 w3 w4 w5 w6 w20

(b) w1 w2 w3 w4 w5 w21 w35

If clause splitting were employed clauses

would be treated as separate segments thus

increasing the likelihood that clauses which

are subparts of larger units could have a match

score sufficiently high to be retrieved by the

TM system

In this paper we compare the effect of

performing clause splitting before retrieving

matches in a TM In each experiment we

identify the difference in matching

performance when a TM tool is used as is and

when the input file and the translation memory

are first run through a clause splitter The

clause splitter we use in this study is a modified version of the one described in Puscasu (2004)

The original version employs both machine

learning and linguistic rules to identify finite

clauses for both English and Romanian but in

this version only the rule-based module is

used Puscasu (2004) developed a clause

splitting method for both English and

Romanian and to maintain consistency

between the two languages her definition of a

clause is the one prescribed by the Romanian

Academy of Grammar which is that a clause is

18

group of words containing a finite verb

Nonfinite and verbless clauses are therefore

not considered The reported F-measure for

identifying complete clauses in English is

8139 (Marsic 2011)

In this study the clause splitter is used on

both the segments in the input file and the translation memory database After processing

the input and the TM segments with the clause

splitter these were then imported into existing

TM tools to examine how well these tools will

perform if clause splitting is used in pre-

processing

4 Experiments

Experiments were performed to study the

impact of clause splitting when used in pre-

processing for the retrieval of segments in a

TM Our hypothesis is that when a clause

splitter is used the number of relevant

retrieved matches will increase

The effect of clause splitting is examined by

comparing the number of matches retrieved

when TM tools are used as is and when both

the input segments and the segments in the

translation memory are first processed with a

clause splitter before being imported into the

TM tools The tools used are Wordfast

Professional 3 and Trados Studio 2009 which

are among the most widely used TM tools

(Lagoudaki 2006)

Segments used as the input were selected

from the Edinburgh paraphrase corpus (Cohn

Callison-Burch and Lapata 2008) (in Macken

2009) We use a paraphrase corpus because we

wish to investigate the effect of using a clause

splitter in pre-processing to retrieve both

segments that contain the entire input clause

and segments that do not contain the exact

input clause but may still be relevant as they

contain a clause that shares a considerable

degree of similarity with the input

We examine the segments retrieved using

both the default fuzzy match threshold (75

for Wordfast and 70 for SDL Trados) and the

minimum threshold (40 for Wordfast and

30 for Trados) It is not normally

recommended for translators to set a low fuzzy

match threshold as this might result in the

retrieval of too many irrelevant segments if the

translation memory is large However in this

study we argue it would be beneficial to

examine matches retrieved with the minimum

threshold as well Given that translation

memory match scores are mainly calculated

using Levenshtein distance if only one clause

in a segment in the TM matches the input there

is a greater chance of the segment being

retrieved with a lower threshold We therefore

wish to examine whether the employment of

clause splitting will still result in a considerable

improvement from using the Levenshtein

distance-based matching algorithm in most TM

tools if the match threshold setting is already

optimised for the retrieval of sub-segment

clauses

It must also be noted that for this study we

are working with the source segments only

Therefore in the TM files used both the source

and target segments are in English

We conducted two main sets of experiments

referred to as Set A and Set B which are

outlined below In Set A we selected sentences

from the Edinburgh corpus that contained

more than one clause We use one clause or

part of it as the input segment and we store

the entire sentence in the TM In the

experiments where no clause splitting is done

the sentence is stored as is In the experiments

with clause splitting the original input

segments are split into clauses (if there are more than one) and the segments in the TM are

the component clauses of the original sentence

For the experiments done without clause

splitting there are 150 input segments and for

each one we test whether the longer

corresponding segment in the TM can be

retrieved For the experiments where clause

splitting is used the 150 input segments are

split into 180 segments as some of these

segments contain more than one clause We

then test whether the corresponding clause

from the original longer sentence can be

retrieved An example is presented in Table 1

19

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 22: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

give good results in classification problemswhere the decision boundary is irregular

5 Results and discussion

We performed two evaluations of the machinelearning algorithms presented in the previous sec-tion The first evaluation is a three-fold stratifiedclassification on the training set The algorithmsare evaluated against two baselines The first base-line it is called Baseline Uniform and it gener-ates predictions randomly The second baselineis called Baseline Stratified and generates predic-tions by respecting the training set class distribu-tion The results of the first evaluation are given intable 1

Algorithm Precision Recall F1Random Forest 095 097 096Decision Tree 098 097 097SVM 094 098 096K-NearstNeighbors 094 098 096LogisticRegression 092 098 095GaussianNaive Bayes 086 096 091BaselineUniform 069 053 060BaselineStratified 070 073 071

Table 1 The results of the three-fold stratifiedclassification

Excepts for the Gaussian Naive Bayes all otheralgorithms have excellent results All algorithmsbeat the baselines by a significant margin (at least20 points)

The second evaluation is performed against thetest set The baselines are the same as in three-foldevaluation above and the results are in table 2

The results for the second evaluation are worsethan the results for the first evaluation For exam-ple the difference between the F1-scores of thebest performing algorithm SVM and the strati-fied baseline is of 10 twice lower than the dif-ference between the best performing classificationalgorithm and the same baseline for the first eval-uation This fact might be explained partially bythe great variety of the bi-segments in the Matecatand Web Sets Obviously this variety is not fullycaptured by the training set

Algorithm Precision Recall F1Random Forest 085 063 072Decision Tree 082 069 075SVM 082 081 081K-NearstNeighbors 083 066 074LogisticRegression 080 080 080GaussianNaive Bayes 076 061 068BaselineUniform 071 072 071BaselineStratified 070 051 059

Table 2 The results of the classification on the testset

Unlike in the first evaluation in the second onewe have two clear winners Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression They produce F1-scores around 08 Theresults might seem impressive but they are insuf-ficient for automatically cleaning MyMemory Tounderstand why this is the case we inspect the re-sults of the confusion table for the SVM algorithmFrom the 309 examples in the test set 175 are truepositives 42 false positives 32 false negatives and60 true negatives This means that around 10 ofall examples corresponding to the false negativeswill be thrown away Applying this method to theMyMemory database would result in the elimina-tion of many good bi-segments We should there-fore search for better methods of cleaning wherethe precision is increased even if the recall dropsWe make some suggestions in the next section

6 Conclusions and further work

In this paper we studied the performance of vari-ous classification algorithms for identifying falsebi-segments in translation memories We haveshown that the distribution of the Church-Galescores in two sets of bi-segments that contain dif-ferent proportion of positive and negative exam-ples is dissimilar This distribution is closer to thenormal distribution for the MateCat set and moresparse for Collective Set The best performingclassification algorithms are Support Vector Ma-chines (with the linear kernel) and Logistic Re-gression Both algorithms produce a significantnumber of false negative examples In this case the

15

performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database

There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative

Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471

ReferencesWilliam A Gale and Kenneth W Church 1993 A

program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS

B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April

Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-

lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31

Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830

Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226

Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory

Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas

16

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015

Abstract

We propose the integration of clause

splitting as a pre-processing step for match

retrieval in Translation Memory (TM)

systems to increase the number of relevant

sub-segment matches Through a series of

experiments we investigate the impact of

clause splitting in instances where the input

does not match an entire segment in the

TM but only a clause from a segment Our

results show that there is a statistically

significant increase in the number of

retrieved matches when both the input

segments and the segments in the TM are

first processed with a clause splitter

1 Rationale

Translation memory tools have had a great

impact on the translation industry as they

provide considerable assistance to translators

They allow translators to easily re-use

previous translations providing them with

valuable productivity gains in an industry

where there is a great demand for quality

translation delivered in the shortest possible

time However existing tools have some

shortcomings The majority of existing tools

rely on Levenshtein distance and seek to

identify matches only at the sentence level

Semantically similar segments are therefore

difficult to retrieve if the string similarity is not

high enough as are sub-segment matches

because if only part of a sentence matches the

input even if this part is an entire clause it is

unlikely that this sentence would be retrieved

(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text

types such as updated versions of technical

manuals

In this study we aim to address the

problem of retrieving sub-segment matches by

performing clause splitting on the source

segment as a pre-processing step for TM match

retrieval While matches for entire sentences

or almost entire sentences are the most useful

type of matches it is also less likely for such

matches to be found in most text types and

even less so for complex sentences Retrieving

clauses is desirable because there is a higher

chance for a match to be found for a clause than

for a complex sentence and at the same time

clauses are similar to sentences in that they

both contain a subject and a verb hence a

ldquocomplete thoughtrdquo therefore clause matches

are more likely to be in context and to actually

be used by the translator than phrase matches

for example

We perform experiments comparing the

match retrieval performance of TM tools when

they are used as is and when the input file and

the TM segments are first processed with a

clause splitter before being fed into the TM

tool The paper is organised as follows In

section 2 we discuss related work on TM

matching In section 3 we discuss how clause

splitting can be beneficial to TM matching and

how clause splitting was implemented for this

study We then describe our experiments in

section 4 and discuss the results in section 5

Finally we present our conclusion and future

work in section 6

Improving Translation Memory Matching through Clause

Splitting

Katerina Timonera

Research Group in Computational

Linguistics

University of Wolverhampton

krtimoneragmailcom

Ruslan Mitkov

Research Group in Computational

Linguistics

University of Wolverhampton

rmitkovwlvacuk

17

2 Related Work

Attempts to address the shortcomings of

existing tools include the integration of

language processing to break down a sentence

into smaller segments The so-called lsquosecond

generationrsquo TM system Similis (Planas 2005)

performs chunking to split sentences into syntagmas to allow sub-sentence matching

However Reinke (2013) observes that for

certain language pairs like English-German

only rather short phrases like simple NPs are

identified and larger syntactic units cannot be

retrieved This can be regarded as disadvantage

as the processing of larger units would be

desirable for the support of professional

computer-assisted human translation (Kriele

(2006) and Macken (2009) cited in Reinke

2013)

MetaMorphoTM (Hodaacutesz and Pohl 2005)

also divides sentences into smaller chunks Moreover it uses a multi-level linguistic

similarity technique (surface form lemma

word class) to determine similarity between

two source-language segments

Other attempts involve deeper linguistic

processing techniques In Pekar and Mitkov

(2007) we propose the lsquothird-generation

translation memoryrsquo which introduces the

concept of semantic matching We employ

syntactic and semantic analysis of segments

stored in a TM to produce a generalised

representation of segments which reduces

equivalent lexical syntactic and lexico-

syntactic constructions into a single

representation Then a retrieval mechanism

operating on these generalised

representations is used to search for useful

previous translations in the TM

This study is part of the third-generation

translation memory project of which the

ultimate goal is to produce more intelligent TM

systems using NLP techniques To the best of

our knowledge clause splitting has not

previously been investigated as a possible

method for increasing the number of relevant

retrieved matches

3 Clause splitting for TM matching

Macklovitch and Russel (2000) note that

when a sufficiently close match cannot be

found for a new input sentence current TM

systems are unable to retrieve sentences that

contain the same clauses or other major

phrases

For example (a) below is a new input

sentence composed of twenty five-character

words The TM contains the sentence (b)

which shares an identical substring with

sentence (a) However as this substring only

makes up only 25 of the sentences total

number of characters it is unlikely that current

TM tools would be able to retrieve it as a fuzzy

match

(a) w1 w2 w3 w4 w5 w6 w20

(b) w1 w2 w3 w4 w5 w21 w35

If clause splitting were employed clauses

would be treated as separate segments thus

increasing the likelihood that clauses which

are subparts of larger units could have a match

score sufficiently high to be retrieved by the

TM system

In this paper we compare the effect of

performing clause splitting before retrieving

matches in a TM In each experiment we

identify the difference in matching

performance when a TM tool is used as is and

when the input file and the translation memory

are first run through a clause splitter The

clause splitter we use in this study is a modified version of the one described in Puscasu (2004)

The original version employs both machine

learning and linguistic rules to identify finite

clauses for both English and Romanian but in

this version only the rule-based module is

used Puscasu (2004) developed a clause

splitting method for both English and

Romanian and to maintain consistency

between the two languages her definition of a

clause is the one prescribed by the Romanian

Academy of Grammar which is that a clause is

18

group of words containing a finite verb

Nonfinite and verbless clauses are therefore

not considered The reported F-measure for

identifying complete clauses in English is

8139 (Marsic 2011)

In this study the clause splitter is used on

both the segments in the input file and the translation memory database After processing

the input and the TM segments with the clause

splitter these were then imported into existing

TM tools to examine how well these tools will

perform if clause splitting is used in pre-

processing

4 Experiments

Experiments were performed to study the

impact of clause splitting when used in pre-

processing for the retrieval of segments in a

TM Our hypothesis is that when a clause

splitter is used the number of relevant

retrieved matches will increase

The effect of clause splitting is examined by

comparing the number of matches retrieved

when TM tools are used as is and when both

the input segments and the segments in the

translation memory are first processed with a

clause splitter before being imported into the

TM tools The tools used are Wordfast

Professional 3 and Trados Studio 2009 which

are among the most widely used TM tools

(Lagoudaki 2006)

Segments used as the input were selected

from the Edinburgh paraphrase corpus (Cohn

Callison-Burch and Lapata 2008) (in Macken

2009) We use a paraphrase corpus because we

wish to investigate the effect of using a clause

splitter in pre-processing to retrieve both

segments that contain the entire input clause

and segments that do not contain the exact

input clause but may still be relevant as they

contain a clause that shares a considerable

degree of similarity with the input

We examine the segments retrieved using

both the default fuzzy match threshold (75

for Wordfast and 70 for SDL Trados) and the

minimum threshold (40 for Wordfast and

30 for Trados) It is not normally

recommended for translators to set a low fuzzy

match threshold as this might result in the

retrieval of too many irrelevant segments if the

translation memory is large However in this

study we argue it would be beneficial to

examine matches retrieved with the minimum

threshold as well Given that translation

memory match scores are mainly calculated

using Levenshtein distance if only one clause

in a segment in the TM matches the input there

is a greater chance of the segment being

retrieved with a lower threshold We therefore

wish to examine whether the employment of

clause splitting will still result in a considerable

improvement from using the Levenshtein

distance-based matching algorithm in most TM

tools if the match threshold setting is already

optimised for the retrieval of sub-segment

clauses

It must also be noted that for this study we

are working with the source segments only

Therefore in the TM files used both the source

and target segments are in English

We conducted two main sets of experiments

referred to as Set A and Set B which are

outlined below In Set A we selected sentences

from the Edinburgh corpus that contained

more than one clause We use one clause or

part of it as the input segment and we store

the entire sentence in the TM In the

experiments where no clause splitting is done

the sentence is stored as is In the experiments

with clause splitting the original input

segments are split into clauses (if there are more than one) and the segments in the TM are

the component clauses of the original sentence

For the experiments done without clause

splitting there are 150 input segments and for

each one we test whether the longer

corresponding segment in the TM can be

retrieved For the experiments where clause

splitting is used the 150 input segments are

split into 180 segments as some of these

segments contain more than one clause We

then test whether the corresponding clause

from the original longer sentence can be

retrieved An example is presented in Table 1

19

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 23: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

performance of finding the true negative examplesdoes not offset the cost of deleting the false nega-tives from the database

There are two potential solutions to this prob-lem The first solution is to improve the perfor-mance of the classifiers In the future we will studyensemble classifiers that can potentially boost theperformance of the classification task The ideabehind the ensemble classifiers is that with differ-ently behaving classifiers one classifier can com-pensate for the errors of other classifiers If thissolution does not give the expected results we willfocus on a subset of bi-segments for which theclassification precision is more than 90 For ex-ample the Logistic Regression classification out-put can be interpreted as probability Our hope isthat the probabilities scores can be ranked and thathigher scores correlate with the confidence that abi-segment is positive or negative

Another improvement will be the substitutionof the machine translation module with a simplertranslation system based on bilingual dictionariesThe machine translation module works well withan average numbers of bi-segments For exam-ple the machine translation system we employ canhandle 40000 bi-segments per day However thissystem is not scalable it costs too much and it can-not handle the entire MyMemory database Unlikea machine translation system a dictionary is rela-tively easy to build using an aligner Moreover asystem based on an indexed bilingual dictionaryshould be much faster than a machine translationsystem

Acknowledgments

The research reported in this paper is supportedby the People Programme (Marie Curie Actions)of the European Unions Framework Programme(FP72007-2013) under REA grant agreement no317471

ReferencesWilliam A Gale and Kenneth W Church 1993 A

program for aligning sentences in bilingual corporaCOMPUTATIONAL LINGUISTICS

B L Humphreys and D A Lindberg 1993 TheUMLS project making the conceptual connectionbetween users and the information they need BullMed Libr Assoc 81(2)170ndash177 April

Philipp Koehn and Jean Senellart 2010 Convergenceof translation memory and statistical machine trans-

lation In Proceedings of AMTA Workshop on MTResearch and the Translation Industry pages 21ndash31

Zhifei Li and Sanjeev Khudanpur 2009 Large-scalediscriminative n-gram language models for statisti-cal machine translation In Proceedings of AMTA

F Pedregosa G Varoquaux A Gramfort V MichelB Thirion O Grisel M Blondel P Pretten-hofer R Weiss V Dubourg J Vanderplas A Pas-sos D Cournapeau M Brucher M Perrot andE Duchesnay 2011 Scikit-learn Machine learn-ing in Python Journal of Machine Learning Re-search 122825ndash2830

Uwe Reinke 2013 State of the art in translation mem-ory technology Translation Computation Cor-pora Cognition 3(1)

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluter 2013 Dgt-tm A freely available translation memory in 22 lan-guages CoRR abs13095226

Jorg Tiedemann 2011 Bitext Alignment Num-ber 14 in Synthesis Lectures on Human LanguageTechnologies Morgan amp Claypool San Rafael CAUSA

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory

Ventsislav Zhechev and Josef van Genabith 2010Maximising tm performance through sub-tree align-ment and smt In Proceedings of the Ninth Confer-ence of the Association for Machine Translation inthe Americas

16

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015

Abstract

We propose the integration of clause

splitting as a pre-processing step for match

retrieval in Translation Memory (TM)

systems to increase the number of relevant

sub-segment matches Through a series of

experiments we investigate the impact of

clause splitting in instances where the input

does not match an entire segment in the

TM but only a clause from a segment Our

results show that there is a statistically

significant increase in the number of

retrieved matches when both the input

segments and the segments in the TM are

first processed with a clause splitter

1 Rationale

Translation memory tools have had a great

impact on the translation industry as they

provide considerable assistance to translators

They allow translators to easily re-use

previous translations providing them with

valuable productivity gains in an industry

where there is a great demand for quality

translation delivered in the shortest possible

time However existing tools have some

shortcomings The majority of existing tools

rely on Levenshtein distance and seek to

identify matches only at the sentence level

Semantically similar segments are therefore

difficult to retrieve if the string similarity is not

high enough as are sub-segment matches

because if only part of a sentence matches the

input even if this part is an entire clause it is

unlikely that this sentence would be retrieved

(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text

types such as updated versions of technical

manuals

In this study we aim to address the

problem of retrieving sub-segment matches by

performing clause splitting on the source

segment as a pre-processing step for TM match

retrieval While matches for entire sentences

or almost entire sentences are the most useful

type of matches it is also less likely for such

matches to be found in most text types and

even less so for complex sentences Retrieving

clauses is desirable because there is a higher

chance for a match to be found for a clause than

for a complex sentence and at the same time

clauses are similar to sentences in that they

both contain a subject and a verb hence a

ldquocomplete thoughtrdquo therefore clause matches

are more likely to be in context and to actually

be used by the translator than phrase matches

for example

We perform experiments comparing the

match retrieval performance of TM tools when

they are used as is and when the input file and

the TM segments are first processed with a

clause splitter before being fed into the TM

tool The paper is organised as follows In

section 2 we discuss related work on TM

matching In section 3 we discuss how clause

splitting can be beneficial to TM matching and

how clause splitting was implemented for this

study We then describe our experiments in

section 4 and discuss the results in section 5

Finally we present our conclusion and future

work in section 6

Improving Translation Memory Matching through Clause

Splitting

Katerina Timonera

Research Group in Computational

Linguistics

University of Wolverhampton

krtimoneragmailcom

Ruslan Mitkov

Research Group in Computational

Linguistics

University of Wolverhampton

rmitkovwlvacuk

17

2 Related Work

Attempts to address the shortcomings of

existing tools include the integration of

language processing to break down a sentence

into smaller segments The so-called lsquosecond

generationrsquo TM system Similis (Planas 2005)

performs chunking to split sentences into syntagmas to allow sub-sentence matching

However Reinke (2013) observes that for

certain language pairs like English-German

only rather short phrases like simple NPs are

identified and larger syntactic units cannot be

retrieved This can be regarded as disadvantage

as the processing of larger units would be

desirable for the support of professional

computer-assisted human translation (Kriele

(2006) and Macken (2009) cited in Reinke

2013)

MetaMorphoTM (Hodaacutesz and Pohl 2005)

also divides sentences into smaller chunks Moreover it uses a multi-level linguistic

similarity technique (surface form lemma

word class) to determine similarity between

two source-language segments

Other attempts involve deeper linguistic

processing techniques In Pekar and Mitkov

(2007) we propose the lsquothird-generation

translation memoryrsquo which introduces the

concept of semantic matching We employ

syntactic and semantic analysis of segments

stored in a TM to produce a generalised

representation of segments which reduces

equivalent lexical syntactic and lexico-

syntactic constructions into a single

representation Then a retrieval mechanism

operating on these generalised

representations is used to search for useful

previous translations in the TM

This study is part of the third-generation

translation memory project of which the

ultimate goal is to produce more intelligent TM

systems using NLP techniques To the best of

our knowledge clause splitting has not

previously been investigated as a possible

method for increasing the number of relevant

retrieved matches

3 Clause splitting for TM matching

Macklovitch and Russel (2000) note that

when a sufficiently close match cannot be

found for a new input sentence current TM

systems are unable to retrieve sentences that

contain the same clauses or other major

phrases

For example (a) below is a new input

sentence composed of twenty five-character

words The TM contains the sentence (b)

which shares an identical substring with

sentence (a) However as this substring only

makes up only 25 of the sentences total

number of characters it is unlikely that current

TM tools would be able to retrieve it as a fuzzy

match

(a) w1 w2 w3 w4 w5 w6 w20

(b) w1 w2 w3 w4 w5 w21 w35

If clause splitting were employed clauses

would be treated as separate segments thus

increasing the likelihood that clauses which

are subparts of larger units could have a match

score sufficiently high to be retrieved by the

TM system

In this paper we compare the effect of

performing clause splitting before retrieving

matches in a TM In each experiment we

identify the difference in matching

performance when a TM tool is used as is and

when the input file and the translation memory

are first run through a clause splitter The

clause splitter we use in this study is a modified version of the one described in Puscasu (2004)

The original version employs both machine

learning and linguistic rules to identify finite

clauses for both English and Romanian but in

this version only the rule-based module is

used Puscasu (2004) developed a clause

splitting method for both English and

Romanian and to maintain consistency

between the two languages her definition of a

clause is the one prescribed by the Romanian

Academy of Grammar which is that a clause is

18

group of words containing a finite verb

Nonfinite and verbless clauses are therefore

not considered The reported F-measure for

identifying complete clauses in English is

8139 (Marsic 2011)

In this study the clause splitter is used on

both the segments in the input file and the translation memory database After processing

the input and the TM segments with the clause

splitter these were then imported into existing

TM tools to examine how well these tools will

perform if clause splitting is used in pre-

processing

4 Experiments

Experiments were performed to study the

impact of clause splitting when used in pre-

processing for the retrieval of segments in a

TM Our hypothesis is that when a clause

splitter is used the number of relevant

retrieved matches will increase

The effect of clause splitting is examined by

comparing the number of matches retrieved

when TM tools are used as is and when both

the input segments and the segments in the

translation memory are first processed with a

clause splitter before being imported into the

TM tools The tools used are Wordfast

Professional 3 and Trados Studio 2009 which

are among the most widely used TM tools

(Lagoudaki 2006)

Segments used as the input were selected

from the Edinburgh paraphrase corpus (Cohn

Callison-Burch and Lapata 2008) (in Macken

2009) We use a paraphrase corpus because we

wish to investigate the effect of using a clause

splitter in pre-processing to retrieve both

segments that contain the entire input clause

and segments that do not contain the exact

input clause but may still be relevant as they

contain a clause that shares a considerable

degree of similarity with the input

We examine the segments retrieved using

both the default fuzzy match threshold (75

for Wordfast and 70 for SDL Trados) and the

minimum threshold (40 for Wordfast and

30 for Trados) It is not normally

recommended for translators to set a low fuzzy

match threshold as this might result in the

retrieval of too many irrelevant segments if the

translation memory is large However in this

study we argue it would be beneficial to

examine matches retrieved with the minimum

threshold as well Given that translation

memory match scores are mainly calculated

using Levenshtein distance if only one clause

in a segment in the TM matches the input there

is a greater chance of the segment being

retrieved with a lower threshold We therefore

wish to examine whether the employment of

clause splitting will still result in a considerable

improvement from using the Levenshtein

distance-based matching algorithm in most TM

tools if the match threshold setting is already

optimised for the retrieval of sub-segment

clauses

It must also be noted that for this study we

are working with the source segments only

Therefore in the TM files used both the source

and target segments are in English

We conducted two main sets of experiments

referred to as Set A and Set B which are

outlined below In Set A we selected sentences

from the Edinburgh corpus that contained

more than one clause We use one clause or

part of it as the input segment and we store

the entire sentence in the TM In the

experiments where no clause splitting is done

the sentence is stored as is In the experiments

with clause splitting the original input

segments are split into clauses (if there are more than one) and the segments in the TM are

the component clauses of the original sentence

For the experiments done without clause

splitting there are 150 input segments and for

each one we test whether the longer

corresponding segment in the TM can be

retrieved For the experiments where clause

splitting is used the 150 input segments are

split into 180 segments as some of these

segments contain more than one clause We

then test whether the corresponding clause

from the original longer sentence can be

retrieved An example is presented in Table 1

19

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 24: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 17ndash23Hissar Bulgaria Sept 2015

Abstract

We propose the integration of clause

splitting as a pre-processing step for match

retrieval in Translation Memory (TM)

systems to increase the number of relevant

sub-segment matches Through a series of

experiments we investigate the impact of

clause splitting in instances where the input

does not match an entire segment in the

TM but only a clause from a segment Our

results show that there is a statistically

significant increase in the number of

retrieved matches when both the input

segments and the segments in the TM are

first processed with a clause splitter

1 Rationale

Translation memory tools have had a great

impact on the translation industry as they

provide considerable assistance to translators

They allow translators to easily re-use

previous translations providing them with

valuable productivity gains in an industry

where there is a great demand for quality

translation delivered in the shortest possible

time However existing tools have some

shortcomings The majority of existing tools

rely on Levenshtein distance and seek to

identify matches only at the sentence level

Semantically similar segments are therefore

difficult to retrieve if the string similarity is not

high enough as are sub-segment matches

because if only part of a sentence matches the

input even if this part is an entire clause it is

unlikely that this sentence would be retrieved

(Pekar and Mitkov 2007) As a result TMs are especially useful only for highly repetitive text

types such as updated versions of technical

manuals

In this study we aim to address the

problem of retrieving sub-segment matches by

performing clause splitting on the source

segment as a pre-processing step for TM match

retrieval While matches for entire sentences

or almost entire sentences are the most useful

type of matches it is also less likely for such

matches to be found in most text types and

even less so for complex sentences Retrieving

clauses is desirable because there is a higher

chance for a match to be found for a clause than

for a complex sentence and at the same time

clauses are similar to sentences in that they

both contain a subject and a verb hence a

ldquocomplete thoughtrdquo therefore clause matches

are more likely to be in context and to actually

be used by the translator than phrase matches

for example

We perform experiments comparing the

match retrieval performance of TM tools when

they are used as is and when the input file and

the TM segments are first processed with a

clause splitter before being fed into the TM

tool The paper is organised as follows In

section 2 we discuss related work on TM

matching In section 3 we discuss how clause

splitting can be beneficial to TM matching and

how clause splitting was implemented for this

study We then describe our experiments in

section 4 and discuss the results in section 5

Finally we present our conclusion and future

work in section 6

Improving Translation Memory Matching through Clause

Splitting

Katerina Timonera

Research Group in Computational

Linguistics

University of Wolverhampton

krtimoneragmailcom

Ruslan Mitkov

Research Group in Computational

Linguistics

University of Wolverhampton

rmitkovwlvacuk

17

2 Related Work

Attempts to address the shortcomings of

existing tools include the integration of

language processing to break down a sentence

into smaller segments The so-called lsquosecond

generationrsquo TM system Similis (Planas 2005)

performs chunking to split sentences into syntagmas to allow sub-sentence matching

However Reinke (2013) observes that for

certain language pairs like English-German

only rather short phrases like simple NPs are

identified and larger syntactic units cannot be

retrieved This can be regarded as disadvantage

as the processing of larger units would be

desirable for the support of professional

computer-assisted human translation (Kriele

(2006) and Macken (2009) cited in Reinke

2013)

MetaMorphoTM (Hodaacutesz and Pohl 2005)

also divides sentences into smaller chunks Moreover it uses a multi-level linguistic

similarity technique (surface form lemma

word class) to determine similarity between

two source-language segments

Other attempts involve deeper linguistic

processing techniques In Pekar and Mitkov

(2007) we propose the lsquothird-generation

translation memoryrsquo which introduces the

concept of semantic matching We employ

syntactic and semantic analysis of segments

stored in a TM to produce a generalised

representation of segments which reduces

equivalent lexical syntactic and lexico-

syntactic constructions into a single

representation Then a retrieval mechanism

operating on these generalised

representations is used to search for useful

previous translations in the TM

This study is part of the third-generation

translation memory project of which the

ultimate goal is to produce more intelligent TM

systems using NLP techniques To the best of

our knowledge clause splitting has not

previously been investigated as a possible

method for increasing the number of relevant

retrieved matches

3 Clause splitting for TM matching

Macklovitch and Russel (2000) note that

when a sufficiently close match cannot be

found for a new input sentence current TM

systems are unable to retrieve sentences that

contain the same clauses or other major

phrases

For example (a) below is a new input

sentence composed of twenty five-character

words The TM contains the sentence (b)

which shares an identical substring with

sentence (a) However as this substring only

makes up only 25 of the sentences total

number of characters it is unlikely that current

TM tools would be able to retrieve it as a fuzzy

match

(a) w1 w2 w3 w4 w5 w6 w20

(b) w1 w2 w3 w4 w5 w21 w35

If clause splitting were employed clauses

would be treated as separate segments thus

increasing the likelihood that clauses which

are subparts of larger units could have a match

score sufficiently high to be retrieved by the

TM system

In this paper we compare the effect of

performing clause splitting before retrieving

matches in a TM In each experiment we

identify the difference in matching

performance when a TM tool is used as is and

when the input file and the translation memory

are first run through a clause splitter The

clause splitter we use in this study is a modified version of the one described in Puscasu (2004)

The original version employs both machine

learning and linguistic rules to identify finite

clauses for both English and Romanian but in

this version only the rule-based module is

used Puscasu (2004) developed a clause

splitting method for both English and

Romanian and to maintain consistency

between the two languages her definition of a

clause is the one prescribed by the Romanian

Academy of Grammar which is that a clause is

18

group of words containing a finite verb

Nonfinite and verbless clauses are therefore

not considered The reported F-measure for

identifying complete clauses in English is

8139 (Marsic 2011)

In this study the clause splitter is used on

both the segments in the input file and the translation memory database After processing

the input and the TM segments with the clause

splitter these were then imported into existing

TM tools to examine how well these tools will

perform if clause splitting is used in pre-

processing

4 Experiments

Experiments were performed to study the

impact of clause splitting when used in pre-

processing for the retrieval of segments in a

TM Our hypothesis is that when a clause

splitter is used the number of relevant

retrieved matches will increase

The effect of clause splitting is examined by

comparing the number of matches retrieved

when TM tools are used as is and when both

the input segments and the segments in the

translation memory are first processed with a

clause splitter before being imported into the

TM tools The tools used are Wordfast

Professional 3 and Trados Studio 2009 which

are among the most widely used TM tools

(Lagoudaki 2006)

Segments used as the input were selected

from the Edinburgh paraphrase corpus (Cohn

Callison-Burch and Lapata 2008) (in Macken

2009) We use a paraphrase corpus because we

wish to investigate the effect of using a clause

splitter in pre-processing to retrieve both

segments that contain the entire input clause

and segments that do not contain the exact

input clause but may still be relevant as they

contain a clause that shares a considerable

degree of similarity with the input

We examine the segments retrieved using

both the default fuzzy match threshold (75

for Wordfast and 70 for SDL Trados) and the

minimum threshold (40 for Wordfast and

30 for Trados) It is not normally

recommended for translators to set a low fuzzy

match threshold as this might result in the

retrieval of too many irrelevant segments if the

translation memory is large However in this

study we argue it would be beneficial to

examine matches retrieved with the minimum

threshold as well Given that translation

memory match scores are mainly calculated

using Levenshtein distance if only one clause

in a segment in the TM matches the input there

is a greater chance of the segment being

retrieved with a lower threshold We therefore

wish to examine whether the employment of

clause splitting will still result in a considerable

improvement from using the Levenshtein

distance-based matching algorithm in most TM

tools if the match threshold setting is already

optimised for the retrieval of sub-segment

clauses

It must also be noted that for this study we

are working with the source segments only

Therefore in the TM files used both the source

and target segments are in English

We conducted two main sets of experiments

referred to as Set A and Set B which are

outlined below In Set A we selected sentences

from the Edinburgh corpus that contained

more than one clause We use one clause or

part of it as the input segment and we store

the entire sentence in the TM In the

experiments where no clause splitting is done

the sentence is stored as is In the experiments

with clause splitting the original input

segments are split into clauses (if there are more than one) and the segments in the TM are

the component clauses of the original sentence

For the experiments done without clause

splitting there are 150 input segments and for

each one we test whether the longer

corresponding segment in the TM can be

retrieved For the experiments where clause

splitting is used the 150 input segments are

split into 180 segments as some of these

segments contain more than one clause We

then test whether the corresponding clause

from the original longer sentence can be

retrieved An example is presented in Table 1

19

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 25: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

2 Related Work

Attempts to address the shortcomings of

existing tools include the integration of

language processing to break down a sentence

into smaller segments The so-called lsquosecond

generationrsquo TM system Similis (Planas 2005)

performs chunking to split sentences into syntagmas to allow sub-sentence matching

However Reinke (2013) observes that for

certain language pairs like English-German

only rather short phrases like simple NPs are

identified and larger syntactic units cannot be

retrieved This can be regarded as disadvantage

as the processing of larger units would be

desirable for the support of professional

computer-assisted human translation (Kriele

(2006) and Macken (2009) cited in Reinke

2013)

MetaMorphoTM (Hodaacutesz and Pohl 2005)

also divides sentences into smaller chunks Moreover it uses a multi-level linguistic

similarity technique (surface form lemma

word class) to determine similarity between

two source-language segments

Other attempts involve deeper linguistic

processing techniques In Pekar and Mitkov

(2007) we propose the lsquothird-generation

translation memoryrsquo which introduces the

concept of semantic matching We employ

syntactic and semantic analysis of segments

stored in a TM to produce a generalised

representation of segments which reduces

equivalent lexical syntactic and lexico-

syntactic constructions into a single

representation Then a retrieval mechanism

operating on these generalised

representations is used to search for useful

previous translations in the TM

This study is part of the third-generation

translation memory project of which the

ultimate goal is to produce more intelligent TM

systems using NLP techniques To the best of

our knowledge clause splitting has not

previously been investigated as a possible

method for increasing the number of relevant

retrieved matches

3 Clause splitting for TM matching

Macklovitch and Russel (2000) note that

when a sufficiently close match cannot be

found for a new input sentence current TM

systems are unable to retrieve sentences that

contain the same clauses or other major

phrases

For example (a) below is a new input

sentence composed of twenty five-character

words The TM contains the sentence (b)

which shares an identical substring with

sentence (a) However as this substring only

makes up only 25 of the sentences total

number of characters it is unlikely that current

TM tools would be able to retrieve it as a fuzzy

match

(a) w1 w2 w3 w4 w5 w6 w20

(b) w1 w2 w3 w4 w5 w21 w35

If clause splitting were employed clauses

would be treated as separate segments thus

increasing the likelihood that clauses which

are subparts of larger units could have a match

score sufficiently high to be retrieved by the

TM system

In this paper we compare the effect of

performing clause splitting before retrieving

matches in a TM In each experiment we

identify the difference in matching

performance when a TM tool is used as is and

when the input file and the translation memory

are first run through a clause splitter The

clause splitter we use in this study is a modified version of the one described in Puscasu (2004)

The original version employs both machine

learning and linguistic rules to identify finite

clauses for both English and Romanian but in

this version only the rule-based module is

used Puscasu (2004) developed a clause

splitting method for both English and

Romanian and to maintain consistency

between the two languages her definition of a

clause is the one prescribed by the Romanian

Academy of Grammar which is that a clause is

18

group of words containing a finite verb

Nonfinite and verbless clauses are therefore

not considered The reported F-measure for

identifying complete clauses in English is

8139 (Marsic 2011)

In this study the clause splitter is used on

both the segments in the input file and the translation memory database After processing

the input and the TM segments with the clause

splitter these were then imported into existing

TM tools to examine how well these tools will

perform if clause splitting is used in pre-

processing

4 Experiments

Experiments were performed to study the

impact of clause splitting when used in pre-

processing for the retrieval of segments in a

TM Our hypothesis is that when a clause

splitter is used the number of relevant

retrieved matches will increase

The effect of clause splitting is examined by

comparing the number of matches retrieved

when TM tools are used as is and when both

the input segments and the segments in the

translation memory are first processed with a

clause splitter before being imported into the

TM tools The tools used are Wordfast

Professional 3 and Trados Studio 2009 which

are among the most widely used TM tools

(Lagoudaki 2006)

Segments used as the input were selected

from the Edinburgh paraphrase corpus (Cohn

Callison-Burch and Lapata 2008) (in Macken

2009) We use a paraphrase corpus because we

wish to investigate the effect of using a clause

splitter in pre-processing to retrieve both

segments that contain the entire input clause

and segments that do not contain the exact

input clause but may still be relevant as they

contain a clause that shares a considerable

degree of similarity with the input

We examine the segments retrieved using

both the default fuzzy match threshold (75

for Wordfast and 70 for SDL Trados) and the

minimum threshold (40 for Wordfast and

30 for Trados) It is not normally

recommended for translators to set a low fuzzy

match threshold as this might result in the

retrieval of too many irrelevant segments if the

translation memory is large However in this

study we argue it would be beneficial to

examine matches retrieved with the minimum

threshold as well Given that translation

memory match scores are mainly calculated

using Levenshtein distance if only one clause

in a segment in the TM matches the input there

is a greater chance of the segment being

retrieved with a lower threshold We therefore

wish to examine whether the employment of

clause splitting will still result in a considerable

improvement from using the Levenshtein

distance-based matching algorithm in most TM

tools if the match threshold setting is already

optimised for the retrieval of sub-segment

clauses

It must also be noted that for this study we

are working with the source segments only

Therefore in the TM files used both the source

and target segments are in English

We conducted two main sets of experiments

referred to as Set A and Set B which are

outlined below In Set A we selected sentences

from the Edinburgh corpus that contained

more than one clause We use one clause or

part of it as the input segment and we store

the entire sentence in the TM In the

experiments where no clause splitting is done

the sentence is stored as is In the experiments

with clause splitting the original input

segments are split into clauses (if there are more than one) and the segments in the TM are

the component clauses of the original sentence

For the experiments done without clause

splitting there are 150 input segments and for

each one we test whether the longer

corresponding segment in the TM can be

retrieved For the experiments where clause

splitting is used the 150 input segments are

split into 180 segments as some of these

segments contain more than one clause We

then test whether the corresponding clause

from the original longer sentence can be

retrieved An example is presented in Table 1

19

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 26: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

group of words containing a finite verb

Nonfinite and verbless clauses are therefore

not considered The reported F-measure for

identifying complete clauses in English is

8139 (Marsic 2011)

In this study the clause splitter is used on

both the segments in the input file and the translation memory database After processing

the input and the TM segments with the clause

splitter these were then imported into existing

TM tools to examine how well these tools will

perform if clause splitting is used in pre-

processing

4 Experiments

Experiments were performed to study the

impact of clause splitting when used in pre-

processing for the retrieval of segments in a

TM Our hypothesis is that when a clause

splitter is used the number of relevant

retrieved matches will increase

The effect of clause splitting is examined by

comparing the number of matches retrieved

when TM tools are used as is and when both

the input segments and the segments in the

translation memory are first processed with a

clause splitter before being imported into the

TM tools The tools used are Wordfast

Professional 3 and Trados Studio 2009 which

are among the most widely used TM tools

(Lagoudaki 2006)

Segments used as the input were selected

from the Edinburgh paraphrase corpus (Cohn

Callison-Burch and Lapata 2008) (in Macken

2009) We use a paraphrase corpus because we

wish to investigate the effect of using a clause

splitter in pre-processing to retrieve both

segments that contain the entire input clause

and segments that do not contain the exact

input clause but may still be relevant as they

contain a clause that shares a considerable

degree of similarity with the input

We examine the segments retrieved using

both the default fuzzy match threshold (75

for Wordfast and 70 for SDL Trados) and the

minimum threshold (40 for Wordfast and

30 for Trados) It is not normally

recommended for translators to set a low fuzzy

match threshold as this might result in the

retrieval of too many irrelevant segments if the

translation memory is large However in this

study we argue it would be beneficial to

examine matches retrieved with the minimum

threshold as well Given that translation

memory match scores are mainly calculated

using Levenshtein distance if only one clause

in a segment in the TM matches the input there

is a greater chance of the segment being

retrieved with a lower threshold We therefore

wish to examine whether the employment of

clause splitting will still result in a considerable

improvement from using the Levenshtein

distance-based matching algorithm in most TM

tools if the match threshold setting is already

optimised for the retrieval of sub-segment

clauses

It must also be noted that for this study we

are working with the source segments only

Therefore in the TM files used both the source

and target segments are in English

We conducted two main sets of experiments

referred to as Set A and Set B which are

outlined below In Set A we selected sentences

from the Edinburgh corpus that contained

more than one clause We use one clause or

part of it as the input segment and we store

the entire sentence in the TM In the

experiments where no clause splitting is done

the sentence is stored as is In the experiments

with clause splitting the original input

segments are split into clauses (if there are more than one) and the segments in the TM are

the component clauses of the original sentence

For the experiments done without clause

splitting there are 150 input segments and for

each one we test whether the longer

corresponding segment in the TM can be

retrieved For the experiments where clause

splitting is used the 150 input segments are

split into 180 segments as some of these

segments contain more than one clause We

then test whether the corresponding clause

from the original longer sentence can be

retrieved An example is presented in Table 1

19

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 27: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

The underlined segments are the

corresponding segments that should be

retrieved

In Set B there are also 150 input segments

for the experiments where no clause splitting

is used and the corresponding segment in the

TM is a longer sentence containing a paraphrase of the input segment For the

experiments with clause splitting there are

185 input segments (as in set A some of the

original 150 have more than one clause) and in

the TM the component clauses of the original

longer segment are stored and we test

whether the clause that is a paraphrase of the

input can be retrieved Below is an example

Without clause splitting

Input Segment in TM

the ministry of defense

once indicated

that about 20000

soldiers were missing in

the korean war

the ministry of defense

once indicated that about

20000 soldiers were

missing in the korean war

and that the ministry of

defense believes there

may still be some

survivors

With clause splitting

Input Segments in TM

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- the ministry of defense

once indicated

- that about 20000

soldiers were missing in

the korean war

- and that the ministry of

defense believes

- there may still be some

survivors

Table 1 Set A ExampleWithout clause

splitting

Input Segment in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

a member of the rap

group so solid crew threw

away a loaded gun during

a police chase southwark

crown court was told

yesterday

With clause splitting

Input Segments in TM

a member of the chart-

topping collective so

solid crew dumped a

loaded pistol in an

alleyway

- a member of the rap

group so solid crew threw

away a loaded gun during

a police chase

- southwark crown court

was told yesterday

Table 2 Set B Example

5 Results

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 2333 9000

Retrieved

(Minimum

threshold) 3800 9222

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 1400 8889

Retrieved

(Minimum

threshold) 1400 9667

Table 3 Percentage of correctly retrieved

segments in Set A

Table 3 shows the results of the experiments in

set A It is clear that clause splitting

considerably increases the number of matches

in instances where the input segment can be

found in a longer segment stored in the TM

When the corresponding segments that could

not be retrieved even with the minimum

threshold were analysed we found that in set

A all instances were due to errors in clause

splitting more specifically the fact that the

clause splitter failed to split a sentence

containing more than one clause

Table 4 summarises the percentage of

correctly retrieved segments in set B In this

set it was observed that although the

percentage of retrieved matches is generally

lower than the percentages in set A there is

20

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 28: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

still a noticeable increase in the percentage of

matches retrieved

Table 4 Percentage of correctly retrieved

segments in Set B

Upon examination of the segments that

could not be retrieved even with the default

threshold we found that in both Wordfast and

Trados around 24 had clause splitting

errors such as when a segment is not split at

all when it has more than one clause or when

the segment is incorrectly split As for the rest

of the unretrieved segments we presume that

they are so heavily paraphrased that even

when clause splitting is performed correctly

the TM tools are still unable to retrieve them

For each experiment we conduct a paired t-

test using the match scores produced by the

TM tools when retrieving each segment (Table

5) When there are no matches or the correct

match is not retrieved the match score is 0 In

instances where the original input segment has

more than one clause and is thus split by the

clause splitter we take the average match

score of the clauses and take this as one case in

order to make the results comparable In all

experiments the difference is significant at the

00001 level when computed with SPSS We

can therefore reject the null hypothesis and

conclude that there is a statistically significant

difference between the results

SET A

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 1923 8530888891

0000

Minimum

threshold 2928 8648888891

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 1178 8387777781 0000

Minimum

threshold 1178 8807111113 0000

SET B

WORDFAST

Mean

p-value

Without

clause

splitting

With clause

splitting

Default

threshold 223 1721111111

0000

Minimum

threshold 649 3062111112

0000

TRADOS

Without

clause

splitting

With clause

splitting p-value

Default

threshold 271 23083

0000

Minimum

threshold 1683 4636777779

0000

Table 5 Paired t-test on all experiments

6 Conclusion

Our results show that introducing clause

splitting as a pre-processing step in TM match

retrieval can significantly increase matching

WORDFAST

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 267 1784

Retrieved

(Minimum

threshold) 1000 4108

TRADOS

Wo clause

splitting

W clause

splitting

Retrieved

(Default

threshold) 333 2595

Retrieved

(Minimum

threshold) 3600 7067

21

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 29: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

performance in instances where the TM

contains segments of which one of the clauses

corresponds to the input segment or is a

paraphrase of the input segment

It is worth mentioning that the data used in

these experiments are not data imported from

the translation memories of practicing translators as they are not easily available We

nevertheless believe that the results of this

study provide significant support to the proof-

of-concept of third-generation TM systems

where NLP processing is expected to improve

performance of operational TM systems

In future work we wish to incorporate

alignment so that on the target side what is

retrieved is not the original target segment but

the corresponding clause as in its current

state our method would only be able to

retrieve the original target segment given that

we perform clause splitting only on the source

side It would also be desirable to implement a

working TM tool that incorporates clause

splitting and examine to what extent these help

a translator working on an actual translation

project as the final test of the usefulness of the

methods employed is how they actually

increase the productivity of translators in

terms of time saved

Acknowledgements

We would like to acknowledge the members of

the Research Group in Computational

Linguistics at University of Wolverhampton for

their support especially Dr Georgiana Marsic Dr Constantin Orasan and Dr Sanja Štajner

Part of the research reported in this paper is

supported by the People Programme (Marie

Curie Actions) of the European Unions

Framework Programme (FP72007-2013)

under REA grant agreement no 317471

References

Cohn T Callison-Burch C and Lapata M

(2008) Constructing Corpora for the

Development and Evaluation of paraphrase

systems Computational Linguistics 34(4)

597-614

Hodaacutesz G and Pohl G (2005) MetaMorpho

TM A Linguistically Enriched Translation

Memory In Proceedings of the International

Conference on n Recent Advances in Natural

Language Processing (RANLP-05)

Borovets Bulgaria

Kriele C (2006) Vergleich der Beiden

Translation-Memory-Systeme TRADOS

und SIMILIS Diploma thesis Saarbruumlcken

Saarland University [unpublished]

Lagoudaki E (2006) Translation Memories

Survey 2006 Usersrsquo Perceptions around

TM Use In proceedings of the ASLIB

International Conference Translating amp the

Computer (Vol 28 No 1 pp 1-29)

Macken L (2009) In Search of the

Recurrent Units of Translation In

Evaluation of Translation Technology ed

by Daelemans Walter and Veacuteronique

Hoste 195-212 Brussels Academic and Scientific Publishers

Macklovitch E and Russell G (2000)

Whatrsquos been Forgotten in Translation

Memory In Envisioning machine

translation in the information future (pp

137-146) Springer Berlin Heidelberg

Marsic G (2011) Temporal Processing of

News Annotation of Temporal Expressions

Verbal Events and Temporal Relations [PhD

thesis University of Wolverhampton]

Pekar V and Mitkov R (2007) New

Generation Translation Memory Content-

Sensitive Matching In Proceedings of the

40th Anniversary Congress of the Swiss

Association of Translators Terminologists

and Interpreters 29-30 September 2006

Bern

Planas E (2005) SIMILIS - Second

generation TM software In Proceedings of

the 27th International Conference

on Translating and the Computer (TC27)

London UK

22

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 30: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

Puscasu G (2004) A Multilingual Method

for Clause Splitting In Proceedings of the

7th Annual Colloquium for the UK Special

Interest Group for Computational

Linguistics

Reinke U (2013) State of the Art in

Translation Memory Technology

Translation Computation Corpora

Cognition 3(1)

23

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 31: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 24ndash30Hissar Bulgaria Sept 2015

Improving translation memory fuzzy matching by paraphrasing

Konstantinos Chatzitheodorou

School of Italian Language and Literature

Aristotle University of Thessaloniki

University Campus

54124 Thessaloniki Greece

chatzikitlauthgr

Abstract

Computer-assisted translation (CAT)

tools have become the major language

technology to support and facilitate the

translation process Those kind of pro-

grams store previously translated source

texts and their equivalent target texts in a

database and retrieve related segments

during the translation of new texts How-

ever most of them are based on string or

word edit distance not allowing retrieving

of matches that are similar In this paper

we present an innovative approach to

match sentences having different words

but the same meaning We use NooJ to

create paraphrases of Support Verb Con-

structions (SVC) of all source translation

units to expand the fuzzy matching capa-

bilities when searching in the translation

memory (TM) Our first results for the

EN-IT language pair show consistent and

significant improvements in matching

over state-of-the-art CAT systems across

different text domains

1 Introduction

The demand of professional translation services

has been increased over the last few years and it is

forecast to continue to grow for the foreseeable

future Researchers to support this increasing

have been proposed and implemented new com-

puter-based tools and methodologies that assist

the translation process The idea behind the com-

puter-assisted software is that a translator should

benefit as much as possible from reusing transla-

tions that have been human translated in the past

The first thoughts can be traced back to the 1960s

when the European Coal and Steel Community

proposed the development of a memory system

that retrieves terms and their equivalent contexts

from earlier translations stored in its memory by

the sentences whose lexical items are close to the

lexical items of the sentence to be translated (Kay

1980)

Since then TM systems have become indispen-

sable tools for professional translators who work

mostly with content that is highly repetitive such

as technical documentation games and software

localization etc TM systems typically exploit not

only exact matches between segments from the

document to be translated with segments from

previous translations but also approximate

matches (often referred to as fuzzy matches)

(Biccedilici and Dymetman 2008) As concept this

technique might be more useful for a translator be-

cause all the previous human translations become

a starting point of the new translation Further-

more the whole process is speeded up and the

translation quality is more consistent and effi-

cient

The fuzzy match level refers to all the neces-

sary corrections made by a professional transla-

tion in order to make the retrieved suggestion to

meet all the standards of the translation process

This effort is typically less than translating the

sentence from scratch To help the translator

CAT tools suggest or highlight all the differences

or similarities between the sentences penaltizing

as well the match percent in some cases However

given the perplexity of a natural language for

similar but not identical sentences the fuzzy

matching level sometimes is too low and therefore

the translator is confused

This paper presents a framework that improves

the fuzzy match of similar but not identical sen-

tences The idea behind this model is that Y2

which is the translation of Y1 can be the equiva-

lent of X1 given that X1 has the same meaning

24

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 32: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

with Y1 We use NooJ to create equivalent para-

phrases of the source texts to improve as much as

possible the translation fuzzy match level given

that they share the same meaning but not the same

lexical items In addition to this we investigate

the following questions (1) is the productivity of

the translators improved (2) are SVC widespread

to merit the effort to tackle them These questions

are answered using human centralized evalua-

tions

The rest of the paper is organized as follows

Section 2 discusses the past related work section

3 the theoretical background section 4 the con-

ceptual background as well the architecture of the

framework Section 5 details the experimental re-

sults and section 5 the plans for further work

2 Related work

There has been some work to improve the transla-

tion memory matching and retrieval of translation

units when working with CAT tools (Koehn and

Senellart 2010 He at al 2010a Zhechev and van

Genabith 2010 Wang et al 2013) Such works

aim to improve the machine translation (MT) con-

fidence measures to better predict the human ef-

fort in order to obtain a quality estimation that has

the potential to replace the fuzzy match score in

the TM In addition to this these techniques have

an effect only in improvement of the MT raw out-

put and not in improvement of fuzzy matching

A common methodology that gives priority to

the human translations is to search first for

matches in the project TM When no such close

match is found in the TM the sentence is ma-

chine-translated (He at al 2010a 2010b) In a

somewhat similar spirit other hybrid methodolo-

gies combine techniques at a sub-sentential level

Most of them use as much as possible human

translations for a given sentence and the un-

matched lexical items are machine translated in

the target language using a MT system (Smith and

Clark 2009 Koehn and Senellart 2010 He at al

2010a Zhechev and van Genabith 2010 Wang et

al 2013) Towards the improving of the quality

of the MT output researchers have been using dif-

ferent MT approaches (statistical rule-based or

example-based) trained either on generic or in-do-

main corpora Another innovative idea has been

proposed by Dong et al (2014) In their work

they use a lattice representation of possible trans-

lations in a monolingual target language corpus to

find the potential candidate translations

On the other hand various researchers have fo-

cused on semantics or syntactic techniques to-

wards improving the fuzzy matching scores in TM

but the evaluations they performed were shallow

and most of the time limited to subjective evalua-

tion by authors Thus this makes it hard to judge

how much a semantically informed TM matching

system can benefit a professional translator

Planas and Furuse (1999) propose approaches that

use lemma and parts of speech along with surface

form comparison In addition to this syntactic an-

notation Hodaacutesz and Pohl (2005) also include

noun phrase (NP) detection (automatic or human)

and alignment in the matching process Pekar and

Mitkov (2007) presented an approach based on

syntax driven syntactic analysis Their result is a

generalized form after syntactic lexico-syntactic

and lexical generalization

Another interested approach similar to ours

has been proposed by Gupta and Orasan (2014)

In their work they generate additional segments

based on the paraphrases in a database while

matching Their approach is based on greedy ap-

proximation and dynamic programming given

that a particular phrase can be paraphrased in sev-

eral ways and there can be several possible

phrases in a segment which can be paraphrased It

is an innovative technique however paraphrasing

lexical or phrasal units in not always safe and in

some cases it can confuse rather than help the

translator In addition to this a paraphrase data-

base is required for each language

Even if the experimental results show signifi-

cant improvements in terms of quality and

productivity the hypotheses are produced by a

machine using unsupervised methods and there-

fore the post-editing effort might be higher com-

paring to human translation hypotheses To the

best of our knowledge there is no similar work in

literature because our approach does not use any

MT techniques given that target side of the TM

remains ldquoas isrdquo To improve the fuzzy matching

we paraphrase the source translation units of the

TM so that a higher fuzzy match will be identified

for sentences sharing the same meaning There-

fore the professional translator is given a human

translated segment that is the paraphrase of the

sentence to be translated This ensures that no out-

of-domain lexical items or no machine translation

errors will appear in the hypotheses making the

post-editing process trivial

25

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 33: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

3 Theoretical background

There are several implementations of the fuzzy

match estimation during the translation process

and commercial products typically do not disclose

the exact algorithm they use (Koehn 2010) How-

ever most of them are based on the word andor

character edit distance (Levenshtein distance)

(Levenshtein 1966) ie the total number of dele-

tions insertions and substitutions in order the two

sentences become identical (Hirschberg 1997)

For instance the word-based string edit dis-

tance between sentence (1) and (2) is 70 (1 sub-

stitution and 3 deletions for 13 words) and the

character-based string edit distance is 76 (14 de-

letions for 60 characters) without counting

whitespaces based on Koehnrsquos (2010) formula for

fuzzy matching This is a low score and many

translators may decide not to use it and therefore

not to gain from it

(1)

(2)

(3)

(4)

Press Cancel to make the cancellation of your personal

information

Press Cancel to cancel your personal information

Premere Cancel per cancellare i propri dati personali

Press Cancel to cancel your booking information

In this case according to methodologies pro-

posed by researchers of this field this sentence

will be sent for machine translation given the low

fuzzy match score and then it should be post-ed-

ited Otherwise the translator should translate it

from scratch However this is not always safe

given that in many cases post-editing MT output

requires more time than translating from scratch

Observing the differences between sentences

(1) and (2) one can easily conclude that they share

the same meaning although they donrsquot share the

same lexical items This happens because of their

syntax In more detail sentence (1) contains a

SVC while sentence (2) contains its nominaliza-

tion An EN-IT professional translator can benefit

from our approach by accepting the sentence (3)

as the equivalent translation of the sentence (1)

SVCs like make a cancellation are verb-noun

complexes which occur in many languages Form

a syntactic and semantic point of view they act in

the same way as multi-word units Their meaning

is mainly reflected by the predicate noun while

the support verb is often semantically reduced

The support verb contributes little content to its

sentence the main meaning resides with the pred-

icate noun (Barreiro 2008)

SVCs include common verbs like give have

make take etc Those types of complexes can be

paraphrased with a full verb maintaining the same

meaning While support verbs are similar to aux-

iliary verbs regarding their meaning contribution

to the clauses in which they appear support verbs

fail the diagnostics that identify auxiliary verbs

and are therefore distinct from auxiliaries (Butt

2003)

SVCs challenge theories of compositionality

because the lexical items that form such construc-

tions do not together qualify as constituents alt-

hough the word combinations do qualify as cate-

nae The distinction of a SVC from other complex

predicates or arbitrary verb-noun combinations is

not an easy task especially because their syntax

that is not always fixed Except of some cases

they appear with direct object (eg to make atten-

tion) or with direct object (eg to make a reserva-

tion) (Athayde 2001)

Our approach paraphrases SVCs found in the

source translation units of a TM in order to in-

crease the fuzzy matching between sentences hav-

ing the same meaning It is a safe technique be-

cause the whole process has no effect on the target

side of the TM translation units Hence the trans-

lators benefit only from human translation hy-

potheses that usually are linguistically correct

In our example an EN-IT translator will re-

ceive an exact match during his performance

when translating the sentence (1) given the Eng-

lish sentence (2) and its Italian equivalent (sen-

tence (3)) that is included in the TM In addition

to this in case of translating the sentence (4) the

fuzzy match score would be around 90 (1 sub-

stitution for 10 words) comparing to 61 with no-

paraphrase (2 substitution and 3 deletions for 13

words) Other than fuzzy match according to Bar-

reiro (2008) machine-translation of SVCs is hard

so the expected output from the machine will not

be good enough In our example ldquocancelrdquo can be

either a verb or noun

4 Conceptual background

As already discussed paraphrasing a SVC can in-

crease the fuzzy match level during the translation

process This section details the pipeline of mod-

ules towards the paraphrase of the TM source

translation units

41 NooJ

The main component of our framework is NooJ

(Silberztein 2003) NooJ is a linguistic develop-

ment environment used to construct large-cover-

age formalized descriptions of natural languages

and apply them to large corpora in real time The

26

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 34: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

module consists of a very large lexicon along

with a large set of local grammars to recognize

named entities as well as unknown words word

sequences etc These resources have been ob-

tained from OpenLogos an old open source rule-

based MT system (Scott and Barreiro 2009) In

NooJ an electronic dictionary contains the lem-

mas with a set of information such as the cate-

gorypart-of-speech (eg V for verbs A for adjec-

tives etc) one or more inflectional andor deriva-

tional paradigms (eg how to conjugate verbs

how to lemmatize or nominalize them etc) one or

more syntactic properties (eg +transitiv for tran-

sitive verbs or +PREPin etc) one or more seman-

tic properties (eg distributional classes such as

+Human domain classes such as +Politics) and

finally one or more equivalent translations

(+IT=ldquotranslation equivalentrdquo) Figure 1 illus-

trates typical dictionary entries

artistN+FLX=TABLE+Hum

cousinN+FLX=TABLE+Hum penN+FLX=TABLE+Conc

tableN+FLX=TABLE+Conc

manN+FLX=MAN+Hum

Figure 1 Dictionary entries in NooJ for nouns

42 Paraphrasing the source translation

units

The generation of the TM that contains the para-

phrased translation units is straightforward The

architecture of the process which is summarized

in Figure 2 is performed in three pipelines

Figure 2 Pipeline of the paraphrase framework

The first pipeline includes the extraction of the

source translation units of a given TM The target

translation units are protected so that they will not

be parsed by the framework This step also in-

cludes the tokenization process Tokenization of

the English data is done using Berkeley Tokenizer

(Petrov et al 2006) The same tool is also used

for the de-tokenization process in the last step

Then all the source translation units pass

through NooJ to identify the SVCs using the local

grammar of Figure 3 To do so NooJ first pre-pro-

cesses and analyses the text based on specific dic-

tionaries and grammars attached in the module

This is a crucial step because if the text is not cor-

rectly analyzed the local grammar will not iden-

tify all the potential SVCs and therefore there will

not be any gain in terms of fuzzy matching Once

the text is analyzed all the possible SVCs are

identified and hence paraphrased

Figure 3 Local grammar for identification and para-

phrasing of SVCs

In more detail the local grammar checks for a

support verb followed by a determiner adjective

or adverb (optionally) a nominalization and op-

tionally by a preposition and generates the verbal

paraphrases in the same tense and person as the

source We should notice that this graph recog-

nizes and paraphrases only SVCs in simple pre-

sent indicative tense However our NooJ module

contains grammars created for the all the other

grammatical tenses and moods that follow the

same structure The elements in red colors charac-

terize the variables as verb and predicate nouns

The elements lt$V=$V+PR+1+sgt and

$N_PR+1+s represent lexical constraints that are

displayed in the output such as specification of

the support verb that belongs to a specific SVC

These particular elements refer to the first person

singular of the simple present tense The predicate

noun is identified mapped to its deriver and dis-

played as a full verb while the other elements of

the sentence are eliminated The final output of

NooJ is a sentence that contains the paraphrase in-

stead of the SVCs were applicable

The last pipeline contains the de-tokenization

as well as the concatenation of the paraphrased

27

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 35: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

translation units in the original TM if any The

paraphrased translation units have the same pro-

prietaries tags etc as the original units

This TM should be imported and used in the

same way as before in all CAT tools As of now

our approach can be applied only to TMs that have

the English language as source As mentioned ear-

lier there is no limit for the target language given

that we apply our approach only to the source lan-

guage translation units

5 Experimental results

The aim of this research is to provide translators

with fuzzy match scores higher than before in case

the TM contains a translation unit which has the

same meaning with the sentence to be translated

Given that there is no automatic evaluation for this

purpose we formulate this as a ranking problem

In this work we analyze a set of 100 sentences

from automotive domain and 100 from IT domain

to measure the difference of the fuzzy match

scores between our approach (parTM) and the

conversional translation process where a plain

TM is used (plTM) This test set was selected

manually in order to contain SVCs in order to en-

sure that each sentence contains at least one SVC

Our method has been applied to a TM which

contained 1025 EN-IT translation units Our mod-

ule recognized 587 SVCs so the generated TM

(parTM) was contained 1612 translation units

(1025 original + 587 paraphrases) The TM con-

tains translations that have been taken from a

larger TM based on the degree of fuzzy match that

at least meets the minimum threshold of 50 To

create the analysis report logs we used Trados Stu-

dio 20141

The results of both analyses are given in Table

1

Our paraphrased TM attains state-of-the-art

performance on increasing the fuzzy match lever-

aging It is interesting to note that the highest

gains are achieved in the low fuzzy categories

(0-74) However we achieve extremely high

numbers in other categories Our approach im-

proves the scores by 17 in 100 match cate-

gory 5 in category 95 - 99 6 in category

85 - 94 28 in category 75 - 84 and fi-

nally 27 in category 0-74 (No match +

50-74) This is a clear indication that para-

phrasing of SVCs significantly improves the re-

trieval results and hence the productivity

1 httpwwwsdlcomcxclanguagetranslation-

productivitytrados-studio

Fuzzy match

category

plTM parTM

100 14 48

95 - 99 23 32

85 - 94 18 29

75 - 84 51 38

50 - 74 32 18

No Match 62 35

Total 200 200

Table 1 Statistics for experimental data

To check the quality of the retrieved segments

human judgment was carried by professional

translators The test set consist of retrieved seg-

ments with fuzzy match score gt=85 (108 seg-

ments) The motivation for this evaluation is two-

fold Firstly to show how much impact paraphras-

ing of SVCs has in terms of retrieval and secondly

to see the translation quality of those segments

which the fuzzy match score is improved because

of the paraphrasing process

According to translators paraphrasing helps

and speeds up the translation process Moreover

the fact that the target segments remain ldquoas isrdquo en-

courage them to use it without a second thought

Figure 4 shows two cases where translators se-

lected to use segments from the parTM We can

see that paraphrasing not only helps to increase

the retrieving but also ensures that the proposed

translation is a human translation so no errors will

appear and less post editing is required in case of

not equal to 100

While there are some drops in terms of fuzzy

match improvement our system presents few

weaknesses Most of them regard the out-of-vo-

cabulary words during the analysis process by

NooJ Although our NooJ module contains a very

large lexicon along with a very large set of local

grammars to recognize and paraphrase SVCs a

few translation units (6 segments) were not para-

phrased In addition to this 2 segments were par-

aphrased incorrectly This happens because they

contain either out-of-vocabulary words or due to

their syntax complexity This is one of our ap-

proachrsquos weaknesses that will be addressed for fu-

ture projects Seg

TMsl TMtg

Make sure that the brake hose is not twisted

Ensure that the brake hose is not twisted Assicurarsi che il tubo flessibile freni non sia at-

torcigliato

28

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 36: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

parTMsl

Make sure that the brake hose is not twisted

Seg

TMsl

TMtg

parTMsl

CAUTION You must make the istallation of the

version 6 of the software

CAUTION You must install the version 6 of the

software

ATTENZIONE Si deve installare la versione 6 del software

CAUTION You must make the istallation of the version 6 of the software

Figure 4 Accepted translations

6 Conclusion

In this paper we have presented a method that im-

proves the fuzzy match of similar but not identi-

cal sentences We use NooJ to create equivalent

paraphrases of the source texts to improve as

much as possible the translation fuzzy match level

given that the meaning is the same but they donrsquot

share the same lexical items

The hybridization strategy implemented has al-

ready been evaluated with different experiments

translators text types and language pairs which

showed that it is very effective The results show

that for all fuzzy-match ranges our approach per-

forms markedly better than the plain TM for dif-

ferent fuzzy match levels especially for low fuzzy

match categories In addition to this the transla-

torsrsquo satisfaction and trust is abundant comparing

to MT approaches

In the future we will continue to explore ways

paraphrasing of other support verbs and other sup-

port languages as well Last but not least a para-

phrase framework to the target sentence may im-

prove even more the quality of translations

References

Athayde M F 2001 Construccedilotildees com verbo-suporte

(funktionsverbgefuumlge) do portuguecircs e do alematildeo In

Cadernos do CIEG Centro Interuniver-sitaacuterio de Es-

tudos Germaniacutesticos n 1 Coimbra Portugal Uni-

versidade de Coimbra

Barreiro A 2008 Make it simple with paraphrases

Automated paraphrasing for authoring aids and

machine translation Ph D thesis Universidate do

Porto

Biccedilici E and Dymetman M 2008 Dynamic Transla-

tion Memory Using Statistical Machine Transla-

tion to improve Translation Memory Fuzzy

Matches In Proceedings of the 9th International

Conference on Intelligent Text Processing and

Computational Linguistics (CICLing 2008) LNCS

Haifa Israel February 2008

But M 2003 The Light Verb Jungle In Harvard Work-

ing Papers in Linguistics ed G Aygen C Bowern

and C Quinn 1ndash49 Volume 9 Papers from the

GSASDudley House workshop on light verbs

Dong M Cheng Y Liu Y Xu J Sun M Izuha T

and Hao J 2014 Query lattice for translation re-

trieval In COLING

Gupta R and Orasan C 2014 Incorporating Para-

phrasing in Translation Memory Matching and Re-

trieval In Proceedings of the 17th Annual Confer-

ence of European Association for Machine Transla-

tion

He Y Ma Y van Genabith J and Way A 2010a

Bridging SMT and TM with translation recommen-

dation In ACL

He Y Ma Y Way A and Van Genabith J 2010b

Integrating n-best SMT outputs into a TM system In

COLING

Hirschberg Daniel S 1997 Serial computations of Le-

venshtein distances Pattern matching algorithms

Oxford University Press Oxford

Hodaacutesz G amp Pohl G 2005 MetaMorpho TM a lin-

guistically enriched translation memory In In inter-

national workshop modern approaches in transla-

tion technologies

Kay M 1980 The proper place of men and machines

in language translation Palo Alto CA Xerox Palo

Alto Research Center October 1980 21pp

Koehn P and Senellart J 2010 Convergence of trans-

lation memory and statistical machine translation

In AMTA

Levenshtein Vladimir I 1966 Binary codes capable of

correcting deletions insertions and reversals In

Soviet physics doklady volume 10 pages 707ndash710

Pekar V amp Mitkov R 2007 New Generation Trans-

lation Memory Content-Sensivite Matching In Pro-

ceedings of the 40th anniversary congress of the

swiss association of translators terminologists and

interpreters

Petrov S Leon B Romain T and Dan K 2006

Learning accurate compact and interpretable tree

annotation In Proceedings of the COLING ACL

pages 433ndash440

Planas E amp Furuse O 1999 Formalizing Translation

Memories In Proceedings of the 7th machine trans-

lation summit (pp 331ndash339)

Scott B Barreiro A 2009 Openlogos MT and the SAL

representation language In Proceedings of the first

international workshop on freeopen-source rule-

based machine translation Alacant pp 19ndash26

29

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 37: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

Silberztein M 2003 NooJ manual Available at httpwwwnooj4nlpnet

Smith J and Clark S 2009 Ebmt for SMT A new

EBMT-SMT hybrid In EBMT

Wang K Zong C and Su K-Y 2014 Dynamically

integrating cross-domain translation memory into

phrase-based machine translation during decoding

In COLING

Zhechev V and van Genabith J 2010 Seeding statis-

tical machine translation with translation memory

output through tree-based structural alignment In

SSST

30

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 38: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 31ndash35Hissar Bulgaria Sept 2015

Increasing Coverage of Translation Memorieswith Linguistically Motivated Segment Combination Methods

Viacutet Baisa Aleš Horaacutek Marek MedvedrsquoNatural Language Processing Centre Masaryk University Brno Czech Republic

xbaisahalesxmedved1fimunicz

Abstract

Translation memories (TMs) used incomputer-aided translation (CAT) systemsare the highest-quality source of paralleltexts since they consist of segment trans-lation pairs approved by professional hu-man translators The obvious problem istheir size and coverage of new documentsegments when compared with other par-allel data

In this paper we describe several methodsfor expanding translation memories usinglinguistically motivated segment combin-ing approaches concentrated on preserv-ing the high translational quality Theevaluation of the methods was done on amedium-size real-world translation mem-ory and documents provided by a Czechtranslation company as well as on a largepublicly available DGT translation mem-ory published by European CommissionThe asset of the TM expansion methodswere evaluated by the pre-translation anal-ysis of widely used MemoQ CAT systemand the METEOR metric was used formeasuring the quality of fully expandednew translation segments

1 Introduction

Most professional translators use a specific CATsystem with provided or self-built translationmemories (TM) The translation memories areusually in-house costly and manually created re-sources of varying sizes of thousands to millionsof translation pairs

Only recently some TMs have been made pub-licly available DGT (Steinberger et al 2013) orMyMemory (Trombetti 2009) to mention just afew But there is still a heavy demand on enlarg-ing and improving TMs and their coverage of new

input documents to be translatedObviously the aim is to have a TM that best fits

to the content of a new document as this is crucialfor speeding up the translation process when alarger part of a document can be pre-translated bya CAT system the translation itself can be cheaperCoverage of TMs is directly translatable to savingsby translation companies and their customers

We present two methods for expanding TMssubsegment generation and subsegment combina-tion The idea behind these methods is based onthe fact that even if the topic of the new documentis well covered by the memory only very rarelythe memory includes exact sentences (segments)as they appear in the document The differencesbetween known and new segments often consist ofsubstitutions or combinations of particular knownsubsegments

The presented methods concentrate on increas-ing the coverage of the content of an existing TMwith regard to a new document and at the sametime try to keep a reasonable quality of newlygenerated segment pairs We work with English-Czech data but the procedures are mostly languageindependent

Evaluation was done on several documents anda medium-size in-house translation memory pro-vided by a large Czech translation company Forcomparison we have also tested the methods onthe DGT translation memory

2 Related work

Translation memories are generally understudiedwithin the field of NLP Machine translation tech-niques especially example-based machine trans-lation (EBMT) employ translation memories in anapproach similar to CAT systems (Planas and Fu-ruse 1999) but NLP approaches have not been ap-plied on them extensively

TM-related papers mainly focus on algorithmsfor searching matching and suggesting segments

31

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 39: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

CZ EN probabilities alignmentbyacutet vetšiacute be greater 0538 0053 0538 0136 0-0 1-1byacutet vetšiacute be larger 0170 0054 0019 0148 0-0 1-1

Figure 1 An example of generated subsegments ndash consistent phrases

within CAT systems (Planas and Furuse 2000)In (Deacutesilets et al 2008) the authors have

attempted to build translation memories fromWeb since they found that human translators inCanada use Google search results even more of-ten than specialized translation memories That iswhy they developed system WeBiText for extract-ing possible segments and their translations frombilingual web pages

In the study (Nevado et al 2004) the authorsexploited two methods of segmentation of transla-tion memories Their approach starts with a sim-ilar assumption as our subsegment combinationmethods presented below ie that a TM cover-age can be increased by splitting the TM segmentsto smaller parts (subsegments) In both cases thesubsegments are generated via the phrase-basedmachine translation (PBMT) technique (Koehn etal 2003) However our methods do not presentthe subsegments as the results The subsegmentsare used in segment combination methods to ob-tain new larger translational phrases or full seg-ments in the best case

(Simard and Langlais 2001) describes a methodof sub-segmenting translation memories whichdeals with the principles of EBMT The au-thors of this study created an on-line systemTransSearch (Macklovitch et al 2000) for search-ing possible translation candidates within all sub-segments in already translated texts These sub-segments are linguistically motivatedmdashthey use atext-chunker to extract phrases from the Hansardcorpus

3 Subsegment generation

In the first step the proposed TM expansion meth-ods process the available translation memory andgenerate all consistent phrases as subsegmentsfrom it Subsegments and the corresponding trans-lations are generated using the Moses (Koehn etal 2007) tool directly from the TM no additionaldata is used

The word alignment is based on MGIZA++(Gao and Vogel 2008) (parallel version ofGIZA++ (Och and Ney 2003)) and the default

Moses heuristic grow-diag-final1 The next stepsare phrase extraction and scoring (Koehn et al2007) The corresponding extended TM is de-noted as SUB The output from subsegment gen-eration contains for each subsegment its transla-tion probabilities and alignment points see Fig-ure 1 for an example

The four probabilities are inverse phrase trans-lation probability inverse lexical weighting directphrase translation probability and direct lexicalweighting respectively They are obtained directlyfrom Moses procedures These probabilities areused to select the best translations in case thereare multiple translations for a subsegment Alter-native translations for a subsegment are combinedfrom different aligned pairs in the TM Typicallyshort subsegments have many translations

The alignment points determine the word align-ment between subsegment and its translation ie0-0 1-1 means that the first word byacutet from sourcelanguage is translated to the first word in the trans-lation be and the second word vetšiacute to the secondgreater These points give us important informa-tion about the translation 1) empty alignment 2)one-to-many alignment and 3) opposite orienta-tion

4 Subsegment combination

The output of the subsegment generation is de-noted as a special translation memory namedSUB The obtained subsegments are then filteredand used by the following methods for subsegmentcombination with regard to the segments from theinput document

bull JOIN new segments are built by concatenat-ing two segments from SUB output is J

1 JOINO joint subsegments overlap in asegment from the document output=OJ

2 JOINN joint subsegments neighbourin a segment from the document out-put=NJ

1httpwwwstatmtorgmosesn=FactoredTrainingAlignWords

32

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 40: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

Table 1 An example of the SUBSTITUTEO method Czechrarr English

original bull ldquolze rozdelit do techto kategoriiacuterdquo (can be divided into these categories)subsegments bull ldquonaacutesledujiacuteciacutech kategoriiacuterdquo (the following categories)

new subsegment ldquolze rozdelit do naacutesledujiacuteciacutech kategoriiacuterdquoits translation can be divided into the following categories

bull SUBSTITUTE new segments can be createdby replacing a part of one segment with an-other subsegment from SUB output is S

1 SUBSTITUTEO the gap in the firstsegment is covered with an overlap withthe second subsegment see the examplein Table 1 output is OS

2 SUBSTITUTEN the second subseg-ment is inserted into the gap in the firstsegment output is NS

During the subsegment non-overlapping combina-tion the acceptability of the combination is de-cided (and ordered) by measuring a language flu-ency score obtained by a combined n-gram score(for n = 〈15〉) from a target language model2

The quality of the subsegment translation can beincreased by filtering the used subsegments onnoun phrase boundaries

The algorithm for the JOIN method actuallyworks with indexes which represent the subseg-ment positions in the tokenized segment Theavailable subsegments are processed as a list I or-dered by the subsegment size (in the number oftokens in descending order) The process startswith the biggest subsegment in the segment andthen tries to join it successively with other sub-segments If it succeeds the new subsegment isappended to a temporary list T After all othersubsegments are processed the temporary list Tof new subsegments is prepended to I and thealgorithm starts with a new subsegment createdfrom the two longest subsegments If it doesnot succeed the next subsegment in the orderis processed The algorithm thus prefers to joinlonger subsegments In each iteration it generatesnew (longer) subsegments and it discards one pro-cessed subsegment

2In current experiments we have trained a languagemodel using KenLM (Heafield 2011) tool on first 50 millionsentences from the enTenTen corpus (Jakubiacutecek et al 2013)

5 Evaluation

For the evaluation of the proposed methods wehave used a medium-size in-house translationmemory provided by a Czech translation companyand two real-world documents of nearly 5000 seg-ments with their referential translations The TMcontains 144311 Czech-English translation pairsfiltered from the complete companyrsquos TM by thesame topic as the tested documents For a com-parison we have run and evaluated the methodsalso on publicly available DGT translation mem-ory (Steinberger et al 2013) with the size over300000 translation pairs

For measuring coverage of the expanded TMswe have used the document and TM analysis toolincluded in the MemoQ software The same eval-uation is used by translation companies for an as-sessment of the actual translation costs The re-sults have been obtained directly from the pre-translation analysis of the MemoQ system Theresults are presented in Table 2 The TM columncontains the results for the original non-expandedtranslation memory The column SUB displaysthe analysis for subsegments (consistent phrases)derived from the original TM The other columnscorrespond to the methods JOIN see Section 4The final column ldquoallrdquo is the resulting expandedTM obtained as a combination of all tested meth-ods All numbers represent coverage of segmentsfrom the input document versus segments from ex-panded TMs The analysis divides all matchessegments to categories (lines in the tables Eachcategory denotes how many words from the seg-ment were found in the analysed TM 100 matchcorresponds to the situation when a whole segmentfrom D can be translated using a segment from therespective TM Translations of shorter parts of thesegment are then matches lower than 100 Themost valuable matches for translation companiesand translators are those over 75ndash85 The pre-sented results show an analysis of the expandedTM for documents with 4563 segments (35142words and 211407 characters)

33

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 41: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

Table 2 MemoQ analysis TM coverage in Match TM SUB OJ NJ all100 041 012 010 017 046

95ndash99 084 091 064 090 13785ndash94 007 005 025 076 08175ndash84 080 091 171 378 44050ndash74 816 1005 2509 4095 4258

any 1028 1204 2779 4656 4962

Table 3 MemoQ analysis DGT-TMMatch SUB OJ NJ all100 008 007 011 028

95ndash99 075 044 049 06685ndash94 005 008 049 06175ndash84 046 096 367 38550ndash74 1024 2777 4190 4447

all 1158 2932 4666 4987

For a comparison we also tested the methodson DGT translation memory (Steinberger et al2013) We have used 330626 pairs from 2014 re-lease See Table 3 for the results of DGT aloneand Table 4 for combination of the TM and DGT

Table 4 MemoQ analysis TM + DGT-TMMatch SUB OJ NJ all100 015 013 029 057

95ndash99 098 059 124 14585ndash94 009 022 134 13775ndash84 103 226 635 70750ndash74 1215 3484 4982 5162

all 1440 3804 5904 6151

We have also compared the results with the out-put of a function called Fragment assembly (Teix-eira 2014) that is present in the MemoQ CAT sys-tem3 Fragment assembly suggests new segmentsbased on several dictionary and non-word ele-ments (term base non-translatable hits numbersauto-translatable hits) Unknown subsegments aretaken from the source language in the tested setupFor measuring the quality of translation (accu-racy) we have used METEOR metric (Denkowskiand Lavie 2014) We have achieved score 029with our data in comparison with MemoQ CATsystem with score 003 when computed for allsegments including those with empty translationsto the target language When we take into ac-

3httpkilgraycomproductsmemoq

Table 5 METEOR 100 matchescompany in-house translation memory

feature SUB OJ NJ NSprec 060 063 070 066

recall 067 074 074 071F1 064 068 072 068

METEOR 031 037 038 038DGT

prec 076 093 091 081recall 078 086 088 085

F1 077 089 089 083METEOR 040 050 051 045

Table 6 Error examples Czechrarr Englishsource seg Oblast dat muže miacutet libovolnyacute

tvarreference The data area may have an arbi-

trary shapegenerated seg Area data may have any shape

count just the segments that are pre-translated byMemoQ Fragment assembly as well as by ourmethods (871 segments) we have achieved thescore of 036 compared to 027 of MemoQ As theMETEOR evaluation metric has been proposedto evaluate MT systems it assumes that we havefully translated segments (pairs) We have thusprovided a ldquomixedrdquo translation in the same wayas it is done in the MemoQ Fragment assemblytechnique ndash non-translated phrases (subsegments)appear in the output segment ldquoas isrdquo ie in thesource language The resulting segment can thusbe a combination of source and target languagewords which is correspondingly taken into ac-count by the METEOR metric We have also mea-sured the asset of particular methods with regardto the translation quality however in this case wehave measured just full 100 matched segmentsThe results are presented in Table 5 Neverthelessthis evaluation was done for the sake of complete-ness It is well known that automatic evaluationmetrics for assessing machine translation qualityare not fully reliable and that a human evaluationis always neededError analysis Regarding the precision we haveanalysed some problematic cases The most com-mon error was when subsegments are combinedin the order in which they occur in the segmentassuming the same order in a target language seethe Table 6

34

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 42: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

We plan to include a phrase assembly techniquethat would analyse the input noun phrases and testthe fluency of their translation by means of thelanguage model Results that would not pass athreshold will not take part in the final segmentcombination method The best evaluation wouldbe extrinsic to use generated TMs in a process oftranslation of a set of documents and measure timeneeded for the translation

6 Conclusion

We presented two methods JOIN and SUBSTI-TUTE which generate new segment pairs for anytranslation memory and input document Bothmethods have variants with overlap and adjointsegments The techniques include linguisticallymotivated techniques for filtering out phraseswhich provide non-fluent output texts in the targetlanguage

We are co-operating with one of major Central-European translation company which provided uswith the testing data and we plan to deploy themethods in their translation process within a fu-ture project

Acknowledgments

This work has been partly supported by the Min-istry of Education of CR within the LINDAT-Clarin project LM2010013 and by the GrantAgency of CR within the project 15-13277S

ReferencesMichael Denkowski and Alon Lavie 2014 Meteor

universal Language specific translation evaluationfor any target language In Proceedings of the EACL2014 Workshop on Statistical Machine Translation

Alain Deacutesilets Benoit Farley M Stojanovic andG Patenaude 2008 WeBiText Building large het-erogeneous translation memories from parallel webcontent Proc of Translating and the Computer3027ndash28

Qin Gao and Stephan Vogel 2008 Parallel implemen-tations of word alignment tool In Software Engi-neering Testing and Quality Assurance for NaturalLanguage Processing pages 49ndash57 Association forComputational Linguistics

Kenneth Heafield 2011 Kenlm Faster and smallerlanguage model queries In Proceedings of the SixthWorkshop on Statistical Machine Translation pages187ndash197 Association for Computational Linguis-tics

Miloš Jakubiacutecek Adam Kilgarriff Vojtech KovaacuterPavel Rychlyacute Viacutet Suchomel et al 2013 The tentencorpus family In Proc Int Conf on Corpus Lin-guistics

Philipp Koehn Franz Josef Och and Daniel Marcu2003 Statistical phrase-based translation InProceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology-Volume 1 pages 48ndash54 Association for Computa-tional Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens et al 2007 Moses Open sourcetoolkit for statistical machine translation In Pro-ceedings of the 45th Annual Meeting of the ACLon Interactive Poster and Demonstration Sessionspages 177ndash180 Association for Computational Lin-guistics

Elliott Macklovitch Michel Simard and PhilippeLanglais 2000 TransSearch A Free TranslationMemory on the World Wide Web In Proceedings ofthe Second International Conference on LanguageResources and Evaluation LREC 2000

Francisco Nevado Francisco Casacuberta and JosuLanda 2004 Translation memories enrichment bystatistical bilingual segmentation In Proceedingsof the Fourth International Conference on LanguageResources and Evaluation LREC 2004

Franz Josef Och and Hermann Ney 2003 A sys-tematic comparison of various statistical alignmentmodels Computational linguistics 29(1)19ndash51

Emmanuel Planas and Osamu Furuse 1999 Formal-izing translation memories In Machine TranslationSummit VII pages 331ndash339

Emmanuel Planas and Osamu Furuse 2000 Multi-level similar segment matching algorithm for trans-lation memories and example-based machine trans-lation In Proceedings of the 18th conference onComputational linguistics-Volume 2 pages 621ndash627 Association for Computational Linguistics

Michel Simard and Philippe Langlais 2001 Sub-sentential exploitation of translation memories InMachine Translation Summit VIII pages 335ndash339

Ralf Steinberger Andreas Eisele Szymon KlocekSpyridon Pilos and Patrick Schluumlter 2013 Dgt-tm A freely available translation memory in 22 lan-guages arXiv preprint arXiv13095226

Carlos SC Teixeira 2014 The Handling of Transla-tion Metadata in Translation Tools page 109 Cam-bridge Scholars Publishing

Marco Trombetti 2009 Creating the worldrsquos largesttranslation memory In MT Summit httpmymemorytranslatednet

35

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 43: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM) pages 36ndash42Hissar Bulgaria Sept 2015

CATaLog New Approaches to TM and Post Editing InterfacesTapas Nayek1 Sudip Kumar Naskar1 Santanu Pal2 Marcos Zampieri23

Mihaela Vela2 Josef van Genabith23

Jadavpur University India1

Saarland University Germany2

German Research Center for Artificial Intelligence (DFKI) Germany3

tnk0205gmailcom sudipnaskarcsejdvuacin mvelamxuni-saarlanddesantanupal|marcoszampieri|josefvangenabithuni-saarlandde

AbstractThis paper explores a new TM-basedCAT tool entitled CATaLog New fea-tures have been integrated into the toolwhich aim to improve post-editing bothin terms of performance and productiv-ity One of the new features of CAT-aLog is a color coding scheme that isbased on the similarity between a par-ticular input sentence and the segmentsretrieved from the TM This color cod-ing scheme will help translators to iden-tify which part of the sentence is mostlikely to require post-editing thus de-manding minimal effort and increas-ing productivity We demonstrate thetoolrsquos functionalities using an English -Bengali dataset

1 IntroductionThe use of translation and text processing soft-ware is an important part of the translationworkflow Terminology and computer-aidedtranslation tools (CAT) are among the mostwidely used software that professional trans-lators use on a regular basis to increase theirproductivity and also improve consistency intranslation

The core component of the vast majorityof CAT tools are translation memories (TM)TMs work under the assumption that previ-ously translated segments can serve as goodmodels for new translations specially whentranslating technical or domain specific textsTranslators input new texts into the CAT tooland these texts are divided into shorter seg-ments The TM engine then checks whetherthere are segments in the memory which areas similar as possible to those from the inputtext Every time the software finds a simi-lar segment in the memory the tool shows

it to the translator as a suitable suggestionusually through a graphical interface In thisscenario translators work as post-editors bycorrecting retrieved segments suggested by theCAT tool or translating new segments fromscratch This process is done iteratively andevery new translation increases the size of thetranslation memory making it both more use-ful and more helpful to future translations

Although in the first place it might soundvery simplistic the process of matching sourceand target segments and retrieving translatedsegments from the TM is far from trivial Toimprove the retrieval engines researchers havebeen working on different ways of incorporat-ing semantic knowledge such as paraphras-ing (Utiyama et al 2011 Gupta and Orăsan2014 Gupta et al 2015) as well as syntax(Clark 2002 Gotti et al 2005) in this pro-cess Another recent direction that researchin CAT tools is taking is the integration ofboth TM and machine translation (MT) out-put (He et al 2010 Kanavos and Kartsak-lis 2010) With the improvement of state-of-the-art MT systems MT output is no longerused just for gisting it is now being used inreal-world translation projects Taking advan-tage of these improvements CAT tools such asMateCat1 have been integrating MT outputalong TMs in the list of suitable suggestions(Cettolo et al 2013)

In this paper we are concerned both withretrieval and with the post-editing interface ofTMs We present a new CAT tool called CAT-aLog2 which is language pair independent andallows users to upload their own memories in

1wwwmatecatcom2The tool will be released as a freeware open-

source software For more information use the follow-ing URL httpttguni-saarlanddesoftwarecatalog

36

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 44: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

the tool Examples showing the basic func-tionalities of CATaLog are presented using En-glish - Bengali data

2 Related Work

CAT tools have become very popular in thetranslation and localization industries in thelast two decades They are used by many lan-guage service providers freelance translatorsto improve translation quality and to increasetranslatorrsquos productivity (Lagoudaki 2008)Although the work presented in this paper fo-cuses on TM it should also be noted thatthere were many studies on MT post-editingpublished in the last few years (Specia 2011Green et al 2013 Green 2014) and as men-tioned in the last section one of the recenttrends is the development of hybrid systemsthat are able to combine MT with TM outputTherefore work on MT post-editing presentssignificant overlap with state-of-the-art CATtools and to what we propose in this paper

Substantial work have also been carried outon improving translation recommendation sys-tems which recommends post-editors either touse TM output or MT output (He et al 2010)To achieve good performance with this kind ofsystems researchers typically train a binaryclassifier (eg Support Vector Machines) todecide which output (TM or MT) is most suit-able to use for post-editing Work on integrat-ing MT with TM has also been done to makeTM output more suitable for post-editingdiminishing translatorsrsquo effort (Kanavos andKartsaklis 2010) Another study presenteda Dynamic Translation Memory which iden-tifies the longest common subsequence in thethe closest matching source segment identifiesthe corresponding subsequence in its transla-tion and dynamically adds this source-targetphrase pair to the phrase table of a phrase-based ststistical MT (PB-SMT) system (Biccediliciand Dymetman 2008)

Simard and Isabelle (2009) reported a workon integration of PB-SMT with TM technol-ogy in a CAT environment in which the PB-SMT system exploits the most similar matchesby making use of TM-based feature func-tions Koehn and Senellart (2010) reportedanother MT-TM integration strategy whereTM is used to retrieve matching source seg-

ments and mismatched portions are translatedby an SMT system to fill in the gaps

Even though this paper describes work inprogress our aim is to develop a tool thatis as intuitive as possible for end users andthis should have direct impact on transla-torsrsquo performance and productivity In therecent years several productive studies werealso carried out measuring different aspectsof the translation process such as cognitiveload effort time quality as well as other crite-ria (Bowker 2005 OrsquoBrien 2006 Guerberof2009 Plitt and Masselot 2010 Federico etal 2012 Guerberof 2012 Zampieri and Vela2014) User studies were taken into accountwhen developing CATaLog as our main moti-vation is to improve the translation workflowIn this paper however we do not yet explorethe impact of our tool in the translation pro-cess because the functionalities required forthis kind of study are currently under devel-opment in CATaLog Future work aims to in-vestigate the impact of the new features we areproposing on the translatorrsquos work

3 System DescriptionWe demonstrate the functionalities and fea-tures of CATaLog in an English - Bengalitranslation task The TM database consistsof English sentences taken from BTEC3 (Ba-sic Travel Expression Corpus) corpus and theirBengali translations4 Unseen input or testsegments are provided to the post-editing tooland the tool matches each of the input seg-ments to the most similar segments containedin the TM TM segments are then ranked ac-cording their the similarity to the test sen-tence using the popular Translation ErrorRate (TER) metric (Snover et al 2009) Thetop 5 most similar segments are chosen andpresented to the translator ordered by theirsimilarity

One very important aspect of computingsimilarity is alignment Each test (input) seg-ment in the source language (SL) is alignedwith the reference SL sentences in the TMand each SL sentence in the TM is aligned toits respective translation From these two sets

3BTEC corpus contains tourism-related sentencessimilar to those that are usually found in phrase booksfor tourists going abroad

4Work in progress

37

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 45: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

of alignments we apply a method to find outwhich parts of the translation are relevant withrespect to the test sentence and which are notie which parts of the TM translation shouldremain intact after post editing and which por-tions should be edited After this processmatched parts and unmatched parts are color-coded for better visualization matched partsare displayed in green and unmatched partsare displayed in red The colors help transla-tors to visualize instantaneously how similarare the five suggested segments to the inputsegment and which one of them requires theleast effort to be post-edited

31 Finding Similarity

For finding out the similar and dissimilar partsbetween the test segment and a matching TMsegment we use TER alignments TER is anerror metric and it gives an edit ratio (often re-ferred to as edit rate or error rate) in terms ofhow much editing is required to convert a sen-tence into another with respect to the lengthof the first sentence Allowable edit operationsinclude insert delete substitute and shift Weuse the TER metric (using tercom-72515) tofind the edit rate between a test sentence andthe TM reference sentences

Simard and Fujita (2012) first proposed theuse of MT evaluation metrics as similarityfunctions in implementing TM functionalityThey experimented with several MT evalua-tion metrics viz BLEU NIST Meteor andTER and studied their behaviors on TM per-formance In the TM tool presented here weuse TER as the similarity metric as it is veryfast and lightweight and it directly mimicsthe human post-editing effort Moreover thetercom-7251 package also produces the align-ments between the sentence pair from which itis very easy to identify which portions in thematching segment match with the input sen-tence and which portions need to be workedon Given below are an input sentence a TMmatch and the TER alignment between themwhere C represents a match (shown as the ver-tical bar lsquo|rsquo) and I D and S represents thethree post-editing actions - insertion deletionand substitution respectively

5httpwwwcsumdedu snovertercom

Input we would like a table by the window

TM Match we want to have a table near thewindow

TER alignmentldquowerdquoldquowerdquoC0ldquowantrdquoldquordquoD0ldquotordquoldquowouldrdquoS0ldquohaverdquoldquolikerdquoS0ldquoardquoldquoardquoC0ldquotablerdquoldquotablerdquoC0ldquonearrdquoldquobyrdquoS0ldquotherdquoldquotherdquoC0ldquowindowrdquoldquowindowrdquoC0ldquordquoldquordquoC0

we want to have a table near the window | D S S | | S | | |we - would like a table by the window

Since we want to rank reference sentencesbased on their similarity with the test sen-tence we use the TER score in an inverse wayTER being an error metric the TER score isproportional to how dissimilar two sentencesare ie the lower the TER score the higherthe similarity We can directly use the TERscore for ranking of sentences However in ourpresent system we have used our own scoringmechanism based on the alignments providedby TER TER gives equal weight to each editoperation ie deletion insertion substitutionand shift However in post-editing deletiontakes much lesser time compared to the otherediting operations Different costs for differ-ent edit operations should yield in better re-sults These edit costs or weights can be ad-justed to get better output from TM In thepresent system we assigned a very low weightto delete operations and equal weights to theother three edit operations To illustrate whydifferent editing costs matter let us considerthe example below

bull Test segment how much does it cost

bull TM segment 1 how much does it costto the holiday inn by taxi

bull TM segment 2 how much

If each edit operation is assigned an equalweight according to TER score TM segment

38

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 46: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

2 would be a better match with respect to thetest segment as TM segment 2 involves insert-ing translations for 3 non-matching words inthe test segment (ldquodoes it costrdquo) as opposed todeleting translations for 6 non-matching words(ldquoto the holiday inn by taxirdquo) in case of TMsegment 1 However deletion of translationsfor the 6 non-matching words from the transla-tion of TM segment 1 which are already high-lighted red by the TM takes much less cogni-tive effort and time than inserting translationsof 3 non-matching words into the translationof TM segment 1 in this case This justifiesassigning minimal weights to the deletion op-eration which prefers TM segment 1 over TMsegment 2 for the test segment shown above

32 Color CodingAmong the top 5 choices post-editor selectsone reference translation to do the post-editingtask To make that decision process easy wecolor code the matched parts and unmatchedparts in each reference translation Green por-tion implies that they are matched fragmentsand red portion implies a mismatch

The alignments between the TM source sen-tences and their corresponding translationsare generated using GIZA++ (Och and Ney2003) in the present work However any otherword aligner eg Berkley Aligner (Liang etal 2006) could be used to produce this align-ment The alignment between the matchedsource segment and the corresponding trans-lation together with the TER alignment be-tween the input sentence and the matchedsource segment are used to generate theaforementioned color coding between selectedsource and target sentences The GIZA++alignment file is directly fed into the presentTM tool Given below is an example TM sen-tence pair along with the corresponding wordalignment input to the TM

bull English we want to have a table nearthe window

bull Bengali আমরা জানালার কােছ একটা েটিবল চাই

bull Alignment NUL () we ( 1 ) want( 6 ) to ( ) have ( ) a ( 4 ) table( 5 ) near ( 3 ) the ( ) window ( 2) ( 7 )

GIZA++ generates the alignment betweenTM source sentences and target sentencesThis alignment file is generated offline onlyonce on the TM database TER givesus the alignments between a test sentenceand the corresponding top 5 matching sen-tences Using these two sets of alignmentswe color the matched fragments in green andthe unmatched fragments in red of the se-lected source sentences and their correspond-ing translations

Color coding the TM source sentencesmakes explicit which portions of matching TMsource sentences match with the test sentenceand which ones not Similarly color codingthe TM target sentences serves two purposesFirstly it makes the decision process easierfor the translators as to which TM match tochoose and work on depending on the colorcode ratio Secondly it guides the translatorsas to which fragments to post-edit The rea-son behind color coding both the TM sourceand target segments is that a longer (matchedor non-matched) source fragment might cor-respond to a shorter source fragment or viceversa due to language divergence A refer-ence translation which has more green frag-ments than red fragments will be a good candi-date for post-editing Sometimes smaller sen-tences may get near 100 green color but theyare not good candidate for post-editing sincepost-editors might have to insert translationsfor more non-matched words in the input sen-tence In this context it is to be noted thatinsertion and substitution are the most costlyoperations in post-editing However such sen-tences will not be preferred by the TM as weassign a higher cost for insertion than deletionand hence such sentences will not be shownas the top candidates by the TM Figure 1presents a snapshot of CATaLog

Input you gave me wrong number

Source Matches

1 you gave me the wrong change i paideighty dollars

2 i think you rsquove got the wrong number

3 you are wrong

4 you pay me

39

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 47: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

5 you rsquore overcharging me

Target Matches

1 আপিন আমােক ভল খচেরা িদেয়েছন আিম আিশ ডলারিদেয়িছ (Gloss apni amake vul khuchrodiyechen ami ashi dollar diyechi )

2 আমার ধারণা আপিন ভল ন ের েফান কেরেছন (Glossamar dharona apni vul nombore phon ko-rechen )

3 আপিন ভল (Gloss apni vul )

4 আপিন আমােক টাকা িদন (Gloss apni amaketaka din )

5 আপিন আমার কােছ েথেক েবিশ িনে ন (Gloss apniamar kache theke beshi nichchen )

For the input sentence shown above theTM system shows the above mentioned colorcoded 5 topmost TM matches in order of theirrelevance with respect to the post-editing ef-fort (as deemed by the TM) for producing thetranslation for the input sentence

33 Improving Search EfficiencyComparing every input sentence against allthe TM source segments makes the TM veryslow In practical scenario in order to getgood results from a TM the TM databaseshould be as large as possible In that case de-termining the TER alignments will take a lottime for all the reference sentences (ie sourceTM segments) For improving the search effi-ciency we make use of the concept of postinglists which is a de facto standard in informa-tion retrieval using inverted index

We create a (source) vocabulary list on thetraining TM data after removing stop wordsand other tokens which occur very frequentlyand have less importance in determining simi-larity All the words are then lowercased Un-like in information retrieval we do not performany stemming of the words as we want to storethe words in their surface form so that if theyappear in the same form as in some input sen-tence only then we will consider it as a matchFor each word in the vocabulary we maintaina posting list of sentences which contain thatword

We only consider those TM source sentencesfor similarity measurement which contain one

or more vocabulary word(s) of the input sen-tence This reduces the search space and thetime taken to produce the TM output TheCATaLog tool provides an option whether touse these postings lists or not This featureis there to compare results using and withoutusing postings lists In ideal scenario TM out-put for both should be the same though timetaken to produce the output will be signifi-cantly different

34 Batch TranslationThe tool also provides an option for translat-ing sentences in bulk mode Post-editors cangenerate TM output for an entire input file ata time using this option In this case the TMoutput is saved in a log file which the post-editors can directly work with later in offlinemode ie without using the TM tool

4 Conclusions and Future Work

This paper presents ongoing research and de-velopment of a new TM-based CAT tool andpost-editing interface entitled CATaLog Eventhough it describes work in progress we be-lieve some interesting new insights are dis-cussed and presented in this paper The toolwill be made available in the upcoming monthsas an open-source free software

We are currently working on different fea-tures to measure time and cognitive load in thepost-editing process The popular keystrokelogging is among them We would like to inves-tigate the impact of the innovations presentedhere in real world experimental settings withhuman translators

We are integrating and refining a coupleof features in the tool as for example sen-tence and clause segmentation using commaand semi-colon as good indicators It shouldalso be noted that in this paper we consideredonly word alignments In the future we wouldalso like to explore how multi-word expressions(MWE) and named entities (NE) can help inTM retrieval and post editing

We are also exploring contextual syntacticand semantic features which can be includedin similarity scores calculation to retrieve moreappropriate translations Another improve-ment we are currently working on concernsweight assignment to different edit operations

40

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 48: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

Figure 1 Screenshot with color cording scheme

We believe these weights can be used to opti-mize system performance

Finally another feature that we are invest-ing is named entity tagging Named entitylists and gazetteers can be used to identify andto translate named entities in the input textThis will help reduce the translation time andeffort for post-editors The last two improve-ments we mentioned are of course languagedependent

AcknowledgementsWe would like to thank the anonymousNLP4TM reviewers who provided us valuablefeedback to improve this paper as well as newideas for future work

English to Indian language Machine Trans-lation (EILMT) is a project funded by theDepartment of Information and Technology(DIT) Government of India

Santanu Pal is supported by the Peo-ple Programme (Marie Curie Actions) ofthe European Unionrsquos Framework Programme(FP72007-2013) under REA grant agreementno 317471

ReferencesErgun Biccedilici and Marc Dymetman 2008 Dy-

namic translation memory using statistical ma-chine translation to improve translation memoryfuzzy matches In Computational Linguisticsand Intelligent Text Processing pages 454ndash465Springer

Lynne Bowker 2005 Productivity vs Quality Apilot study on the impact of translation memorysystems Localisation Reader pages 133ndash140

Mauro Cettolo Christophe Servan NicolaBertoldi Marcello Federico Loic Barraultand Holger Schwenk 2013 Issues in in-cremental adaptation of statistical MT fromhuman post-edits In Proceedings of the Work-shop on Post-editing Technology and Practice(WPTP-2) Nice France

JP Clark 2002 System method and prod-uct for dynamically aligning translations in atranslation-memory system February 5 USPatent 6345244

Marcello Federico Alessandro Cattelan andMarco Trombetti 2012 Measuring UserProductivity in Machine Translation EnhancedComputer Assisted Translation In Proceedingsof Conference of the Association for MachineTranslation in the Americas (AMTA)

Fabrizio Gotti Philippe Langlais ElliottMacklovitch Benoit Robichaud Didier Bouri-gault and Claude Coulombe 2005 3GTM AThird-Generation Translation Memory In 3rdComputational Linguistics in the North-East(CLiNE) Workshop Gatineau Queacutebec aug

Spence Green Jeffrey Heer and Christopher DManning 2013 The efficacy of human post-editing for language translation In Proceedingsof the SIGCHI Conference on Human Factorsin Computing Systems

Spence Green 2014 Mixed-initiative natural lan-guage translation PhD thesis Stanford Uni-versity

Ana Guerberof 2009 Productivity and qualityin the post-editing of outputs from translationmemories and machine translation LocalisationFocus 7(1)133ndash140

Ana Guerberof 2012 Productivity and Qualityin the Post-Edition of Outputs from TranslationMemories and Machine Translation PhD the-sis Rovira and Virgili University Tarragona

41

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 49: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

Rohit Gupta and Constantin Orăsan 2014 In-corporating Paraphrasing in Translation Mem-ory Matching and Retrieval In Proceedings ofEAMT

Rohit Gupta Constantin Orăsan MarcosZampieri Mihaela Vela and Josef van Gen-abith 2015 Can Translation Memories affordnot to use paraphrasing In Proceedings ofEAMT

Yifan He Yanjun Ma Josef van Genabith andAndy Way 2010 Bridging SMT and TM withtranslation recommendation In Proceedings ofACL pages 622ndash630

Panagiotis Kanavos and Dimitrios Kartsaklis2010 Integrating Machine Translation withTranslation Memory A Practical Approach InProceedings of the Second Joint EM+CNGLWorkshop Bringing MT to the User Researchon Integrating MT in the Translation Industry

Philipp Koehn and Jean Senellart 2010 Con-vergence of translation memory and statisticalmachine translation In Proceedings of AMTAWorkshop on MT Research and the TranslationIndustry pages 21ndash31

Elina Lagoudaki 2008 The value of machinetranslation for the professional translator InProceedings of the 8th Conference of the Associ-ation for Machine Translation in the Americaspages 262ndash269 Waikiki Hawaii

Percy Liang Ben Taskar and Dan Klein 2006Alignment by Agreement In Proceedings of theMain Conference on Human Language Technol-ogy Conference of the North American Chapterof the Association of Computational Linguisticspages 104ndash111

Sharon OrsquoBrien 2006 Eye-tracking and transla-tion memory matches Perspectives Studies inTranslatology 14185ndash204

Franz Josef Och and Hermann Ney 2003A systematic comparison of various statisticalalignment models Computational linguistics29(1)19ndash51

Mirko Plitt and Franccedilois Masselot 2010 A pro-ductivity test of statistical machine translationpost-editing in a typical localisation contextThe Prague Bulletin of Mathematical Linguis-tics 937ndash16

Michel Simard and Atsushi Fujita 2012 APoor Manrsquos Translation Memory Using MachineTranslation Evaluation Metrics In Proceedingsof the 10th Biennial Conference of the Associ-ation for Machine Translation in the Americas(AMTA) San Diego California USA

Michel Simard and Pierre Isabelle 2009 Phrase-based machine translation in a computer-assisted translation environment Proceedings ofthe Twelfth Machine Translation Summit (MTSummit XII) pages 120ndash127

Matthew Snover Nitin Madnani Bonnie Dorr andRichard Schwartz 2009 Fluency Adequacyor HTER Exploring Different Human Judg-ments with a Tunable MT Metric In Proceed-ings of the Fourth Workshop on Statistical Ma-chine Translation EACL 2009

Lucia Specia 2011 Exploiting objective annota-tions for measuring translation post-editing ef-fort In Proceedings of the 15th Conference ofthe European Association for Machine Transla-tion pages 73ndash80 Leuven Belgium

Masao Utiyama Graham Neubig Takashi Onishiand Eiichiro Sumita 2011 Searching Transla-tion Memories for Paraphrases pages 325ndash331

Marcos Zampieri and Mihaela Vela 2014 Quanti-fying the Influence of MT Output in the Trans-latorsrsquo Performance A Case Study in Techni-cal Translation In Proceedings of the Workshopon Humans and Computer-assisted Translation(HaCaT 2014)

42

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces
Page 50: Proceedings of the Workshop on Natural Language … Language Processing for Translation Memories ... Proceedings of the Workshop on Natural Language Processing for Translation

Author Index

Baisa Viacutet 31Barbu Eduard 9

Chatzitheodoroou Konstantinos 24

Horak Ales 31

Medvedrsquo Marek 31Mitkov Ruslan 17

Naskar Sudip Kumar 36Nayek Tapas 36

Pal Santanu 36Parra Escartiacuten Carla 1

Raisa Timonera Katerina 17

van Genabith Josef 36Vela Mihaela 36

Zampieri Marcos 36

43

  • Program
  • Creation of new TM segments Fulfilling translators wishes
  • Spotting false translation segments in translation memories
  • Improving Translation Memory Matching through Clause Splitting
  • Improving translation memory fuzzy matching by paraphrasing
  • Increasing Coverage of Translation Memories with Linguistically Motivated Segment Combination Methods
  • CATaLog New Approaches to TM and Post Editing Interfaces