a Tool for Automatic Rule-based Post-editing of...

Preview:

Citation preview

a Tool for Automatic Rule-based Post-editing of SMT

This research was supported by the grants FP7-ICT-2011-7-288487 (MosesCore), FP7-ICT-2013-10-610516 (QTLeap), GAUK 1572314, and SVV 260 104. This work has been using language resources developed, stored and distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2010013).

DepfixRudolf Rosa

rosa@ufal.mff.cuni.czCharles University in Prague

Faculty of Mathematics and PhysicsInstitute of Formal and Applied Linguistics

Input Treex Analysistokenizer (Treex)

lemmatizing tagger (MorphoDiTa)

word aligner (GIZA++)

dependency parser (MST)

dependency relations labeller (Treex)

named entity recognizer (Stanford)

Noun – Adjective Agreement

increaseNN

half-heartedJJ

nárůstuNNIS2nárůst

F: gender = feminine

polovičatéAAFS2polovičatý

nárůstuNNIS2nárůst

polovičatéhoAAIS2polovičatý

I: gender = masculine inanimate

morphologicalgenerator

word form:polovičatéholemma: polovičatý

tag: AAIS2

morphological tagword form

lemma

set adjective genderto noun gender

Preposition – Noun Agreement

oPP--6o

aboutIN

MandelaNNP

oPP--6o

set noun caseto preposition case

Translation of ‘of’

schodekNNIS1schodek

financeNNFP4finance

deficitNN

ofIN

set noun caseto genitive

financesNNS

schodekNNIS1schodek

financíNNFP2finance

Translation of Possessive Nouns

jeVBS3Pbýt

rybaNNFS1ryba

fishNN

DavidNNP

generate correspondingpossessive adjective

’sPOS

DavidNNMS1David

rybaNNFS1ryba

DavidovaAUFS1MDavidův

‘David is fish’

Subject Morphological Case

vyzvaliVpMPXRvyzvat

calledVBD

votersNNs

set subject caseto nominative

voličeNNMP4volič

‘voters were called’

vyzvaliVpMPXRvyzvat

voličiNNMP1volič

Verb Tense Translation

defendingVBG

areVBP

set verb tenseto equivalent of English tense

‘were defending’

generalsNNS

generálovéNNMP1generál

bráníVBP3Pbránit

generálovéNNMP1generál

Lost Negation Recovery

deserveVB

notRB negate the verb

‘deserves’

doesVBZ

Translation of ‘by’

zmařenoVsNSXXzmařit

foiledJJ

byIN

set noun caseto instrumentative

voluntarismNN

zmařenoVsNSXXzmařit

Subject Person Projection

areVBP

wePRP

‘they are’

jsouVBP3Pbýt

jsmeVBP1Pbýt

set verb person toEnglish subject person

Depfix Correction Rules current version: 28 rules

Input The source is available to all components to minimize error propagation.

CU Chimera English → Czech MT System

http://ufal.mff.cuni.cz/depfix

Get Depfix Today!

• implemented in Perl in Treex NLP framework• needs Linux + 4GB RAM• 3 seconds per sentence• released under GNU/GPL• source is commented

Depfix Evaluation (Δ BLEU)

CU BojarCU TectoMTUEDINJHUEurotranMicrosoft BingGoogle Translate

+0.07−0.02+0.23+0.32+0.15+0.37

0.00

WMT 2012+0.47−0.10+0.64+0.42+0.21

?+0.23

WMT 2011

Evaluation of CU Chimera

CU Chimera UEDINCU BojarGoogle TranslateCU TectoMT

WMT 2013

0.5780.5250.5800.5620.476

Manual20.019.120.1

?14.7

BLEU19.218.618.9

?14.2

cBLEUWMT 2014

0.3710.3560.3330.169

−0.175

Manual22.022.122.1

?15.8

BLEU21.121.620.920.215.4

cBLEU

Moses (Bojar)- phrase-based SMT- large-scale data- morphological tags as factors for a better grammatical coherence

Depfix (Rosa)- automatic post-editing- generates the final output

TectoMT (Popel)- hybrid (rule-based/statistical) MT system- transfer at tectogrammatical (deep syntactical) layer- our combination: get an extra phrase table for Moses from TectoMT output

7: instru-mentative

voluntarismemNNIS7voluntarismums

2: genitive

voluntarismuNNIS2voluntarismus

4: accusative

1: nominative

1: nominative

MandelaNNMS1Mandela

6: locative

MandeloviNNMS6Mandela

‘we are’ ‘does not deserve’

N: negatedA: affirmative

nezasloužíVBS3P-Nzasloužit

zasloužíVBS3P-Azasloužit

R: past P: present

brániliVpMPXRbránit

‘are defending’

4: accusative

2: genitive

Depfix improves most MT systems!

CU Chimera is the state-of-the-art for English → Czech MT!

conversion to tectogrammar (Treex)

source English sentence

Czech sentence from SMT

English → CzechMachine TranslationMoses/Google Translate/...

a-treezone=en

ISbPRP

doAuxVVBP

n'tNegRB

meanPredVB

toAuxVTO

implyAdvVB

thatAtrDT

garbageSbNN

isAuxVVBZ

n'tNegRB

sortedAdvVBN

inAuxPIN

BrnoAdvNNP

.AuxK.

a-treezone=cs

NechciPredVB-S1PA

naznačovatObjVf

,AuxXZ:

žeAuxCJ,

odpadkySbNNIP1

seAuxRP7-X4

třídíAdvVB-P3PA

vAuxPRR6

BrněAdvNNNS6

.AuxKZ:

Analytical (surface syntax)dependency trees

Brnog_

n-treezone=en

Brnog_

n-treezone=cs

Namedentities

t-treezone=en

#PersPronACTn:subjI

mean.enuncPREDv:findo n't mean

#CorACTn:elided

implyPAT v:to+infto imply

thatRSTRn:attrthat

garbageACTn:subjgarbage

sortPAT v:finis n't sorted

BrnoLOCn:in+Xin Brno

t-treezone=cs

#PersPronACTdrop

naznačovat.enuncPREDv:finNechci naznačovat

odpadkyACTn:1odpadky

tříditPAT v:že+fin, že se třídí

BrnoLOCn:v+6v Brně

Tectogrammatical(deep syntax) trees

1. Analyzed inputEnglish source Czech MT output

4. Correct output

2. Correction on tag

3. Generation ofaccording word form

Recommended