24
technology from seed Machine Translation of Discontinuous Multiword Units DISCO, NAACL - San Diego USA June 17 th 2016 ANABELA BARREIRO INESC-ID FERNANDO BATISTA INESC-ID, ISCTE-IUL

Machine Translation of Discontinuous Multiword Units

Embed Size (px)

Citation preview

technology from seed

MachineTranslationofDiscontinuous

MultiwordUnits

DISCO, NAACL - San Diego USA �June 17th 2016

ANABELABARREIROINESC-ID

FERNANDOBATISTAINESC-ID,ISCTE-IUL

•  Introduction–  DiscontinuousMultiwordUnits(DMWU)inNLP–  MainCurrentShortcomings–  OurGoal

•  CLUE-AlignerAlignmentTool•  TheLogosModel

–  AlignmentofDMWUInspiredbyLogos•  bring[]toaconclusion•  set[]inmotion•  play[]role•  take[]interestin•  keep[]informedabout

•  PreliminaryResults–  AnalysisofPreliminaryResults

•  AdvantagesoftheLogosModel•  ConclusionsandFutureDirections•  FinalRemark

–  TheeSPERToProject

Outline

2

•  Increasinginterestinmultiwordunits(MWU)intheRieldofNLP

“lexical items that: (a) can be decomposed intomultiple lexemes; and (b) display lexical,syntactic,semantic,pragmaticand/orstatisticalidiomaticity”(BaldwinandKim2010)

•  Compositionality property – causes automatic processing of MWUparticularlychallenging–  Freecombinations

roundtable=meeting–  Opaquemeanings

•  pieceofcake=easytodo•  payavisit=visit

–  Cannotbetranslatedword-for-word•  rainingcatsanddogs

–  Allowinsertions(=wordsthatarenotpartoftheunit)•  tobring[INSERTION]toaconclusionIwouldurgetheEuropeanCommissiontobringtheprocessofadoptingthedirectiveonadditionalpensionstoaconclusion

Introduction

3

•  Non-adjacentlinguisticphenomena–remotedependency•  Commonacrosslanguages•  DifRiculttorecognizeandprocess•  RemainaproblemforNLPapplications

•  LackofformalizationstilltriggersproblemswiththesyntacticandsemanticanalysisofsentencescontainingMWU

•  ImpairmentofNLPsystems’performance•  CauseMTtofailinassigningthecorrecttranslation

•  For SMT systems, DMWU constitute signiRicant challenges tocorrect word and phrase alignment (Shen et al. 2009), andtherefore,tohighqualityMT

DiscontinuousMultiwordUnitsinNLP

4

•  Linguisticknowledgeisstilllimitedinmostsystems–  Some SMTmethodologies relymostly on statistics to train/evaluate

MT systems, use probabilistic alignments with no/little linguisticknowledge,disregardsyntacticdiscontinuity.

–  InabilitytoidentifyMWUcorrectlyresultsintranslationdeRiciencies.•  Lack of publicly available manual multilingual datasets, and of

linguisticallymotivatedalignmentguidelines–  Publicly available alignments are mostly bilingual, with some

exceptions(Graçaetal.2008)–  Guidelines cover cross-linguistic phenomena superRicially, excluding

importantalignmentchallengespresentedbyDMWU.•  Lackofmorerobustalignmenttools

–  Limitations in assisting human annotators in the task of identifyingandaligningcorrectlyDMWUandproducerulesfromthem.

MainCurrentShortcomings

5

•  PresentanexperimentalempiricalanalysisofDMWU•  Stresstherelevanceofcorrect(andnon-arbitrary)alignmentofDMWU•  Highlightanalignmentmethodology inspiredby theLogosModel (Scott,

2003;Barreiroetal.,2011)andtheSemtabfunctiontodeploysemantico-syntacticknowledgethatallowstotranslateDMWUwithhighRidelity

•  IllustrateDMWUmanualalignmentsproducedwithCLUE-Aligner–Cross-Language Unit Elicitation – a Web alignment interactive tool (Barreiro,Raposo,Luís2016)

*Even though similar in name to the "clue alignment approach” (Tiedemann, 2003; 2004; 2011),mainly devoted to word-level alignment, our approach is theoretically and methodologicallydifferent with a focus on phrase alignment, contemplatingmultiwords and linguistically-relevantphrasalunits.

Ourgoal

6

•  Allowstheblock-alignmentofcontiguousandDMWU•  Uses amatrixvisualization andcoloring schemes that helpdistinguish

betweensure(S)andpossible(P)alignments•  Allowsstorageofpairsofparaphrasticunits,withindicationoftheplace

ofinsertions,representedby"[]"–  Iurge[]to|Exorto[]a–  This feature is valuable in the construction of translation rules orgrammarsandsyntacticparsers thatusethoseparaphrasticpairs, forwhichprecisionisimportant

–  ItisalsoimportantinMLtohelplearningconstituents

7

CLUE-Aligner

insertion

insertion

Black cells represent full/optimal semantic correspondenceGrey cells represent approximate semantic correspondence

LightorangecellgroupsrepresentunalignedP-inser3onsDarkorangecellgroupsrepresentunalignedS-inser3ons

pre-processingofcontractedforms

still ainda

CLUE-AlignerInterface

SingleWordAlignmentsandBlockAlignments

DMWs

andInsertionsLightgreencell/cellgroupsrepresentalignedP-inser3onsDarkgreencell/cellgroupsrepresentalignedS-inser3ons

•  Integrates semantic and contextual knowledge and applies it to thetranslationprocess

•  Precision is associated with the application of Semtab semantic andcontextualdata-drivenpattern-rules,whicharedeepstructurepatternsthat match on (apply to) a great variety of surface structures, includingDMWU–  deal(VI)withN(questions)=s’occuperdeN

•  AlignmentsthatmirrorSemtabsemanticnuancescanhelpcreatenewMTsystemsandimproveexistingones

TheLogosModel

10

AlignmentofDMWUInspiredbyLogos

11

•  Europarlcorpus(Koehn2005)-containsalargenumberofoccurrencesofDMWU(subsetwith47.4millionwords)

•  5casesofSVCillustrate“bad”translationerrors–  Searchwasperformedonallformsofeachverb

–  Learning automatic models to deal with DWMU may not bestraightforward

bring[]toaconclusion

12

Containsa9wordinser3on

AlignmentoftheENdiscontinuousSVCwiththePTequivalentstylistic

variant,thecompoundverb“apressar-seaapresentar”

set[]inmotion

13

AlignmentoftheENdiscontinuousSVCwiththeequivalentPTsingleverb“empreender”(“undertake”)

Inser3onofthedirectobject

play[]role

14

AlignmentoftheENSVCwiththeequivalentPTnon-elementarySVC

“desempenhamumpapel”

Inser3onofanadjec3vemodifier

take[]interestin

15

Inser3onofanadjec3ve

AlignmentofaENdiscontinuousprepositionalverbwithitsequivalentFR

reUlexiveprepositionalverb

FRtranslationalsocontainsinsertions

keep[]informedabout

16

Inser3onofanadverb

AlignmentofaENdiscontinuousSVCwithaprepositionaladjectivewithits

ESequivalentprepositionalSVC

ENtranslationalsocontainsanadverbialinsertion

•  1st20sentencesfromsubsetcorpusrepresentingeachofthe5DMWUcases•  TranslatedeachsentencewithGoogleTranslatetoverifytranslationquality•  Performedanempiricalevaluationoftheachievedtranslations

PreliminaryResults

17

bring [ ] to a conclusion

set [ ] in motion play [ ] role take [ ] interest in keep [ ] informed about

0

5

10

15

20

25

correct

incorrect, inadequate or non-optimal (literal, unnatural)

AnalysisofPreliminaryResults

18

DMWU (support verb construction) Google Translate Correct translation

to bring [this dossier] to a conclusion trazer a uma conclusão concluir / terminar [este dossier]

set […] in motion estabeleceu […] em movimento iniciou / pôs em marcha […]

play [the] role jogar [o] papel desempenhar [o] papel

take [a lukewarm] interest in *ter um interesse [*morna] em manifeste / demonstre um interesse [morno/fraco/ténue]

keep [us] informed about *tem [nos] *manteve informados sobre nos tem mantido informados sobre nos tem informado sobre

EN–ItisunacceptablefortheCommissiononlytotakealukewarminterestinacountry.PT-GT–ÉinaceitávelqueaComissãosóa*teruminteressemornaemumpaís.LexicalerrorsrelatedtoDMWU+Structuralerrors•  Lackofagreement(paranosmanterregulareestreitamente

*informadosobre;queoParlamento*serbem*informadossobre)•  Incorrectwordorder(seconseguirmos*aadoptarede5ini-loem

movimento)•  Etc.

AdvantagesoftheLogosModel

19

•  Consistent and efRicient solution to process DMWU, not consistentlyprocessedinformerwordorphrasealignmenttechniques

•  Ability to relate constituents that are apart (even very far apart) in thesentence

•  Consistentwaytoanalyzeandtranslatewordsincontext•  AbilitytogeneralizebetweenalternativeformsofthesameMWU,phrase

orexpression(takeawalk=walk)•  Semtabhasa robust solution for theproblemofopenclass itemsor less

frequentMWU and phrases that cannot be learnt quickly and translatedcorrectly by an SMT system, but annoyingly can be observed in MTtranslations(alsousedinnon-nativespeakearisms)–  makeavisitorpayavisit?

•  MWUarenotprocessedonaword-for-wordbasis, theyrepresentatomicsemantico-syntacticandtranslationunits

•  StandardMTsystemscanbeneRitfromacorrectprocessingofDMWU•  currentlynotbeingexploredefRiciently•  processing,recognitionandtranslationofDMWUischallenging

•  SomemethodologiesareinefRicient•  theyviolatetheintrinsicpropertyoftheunitasanatomicgroupof

elements•  elementsoftheunitcannotbeseparatedoralignedindividually•  unitboundariesneedtoberespected

•  Post-editingeffortscanbeminimizedbyimprovingalignmentquality•  EventhoughweanalyzedjustafewcasesofSVC,ourRindingspointouttoa

generallackofqualityinthetranslationofDMWU(anddiscontinuousphrasalexpressions)

ConclusionsandFutureDirections

20

•  Validation•  BroaderquantiRicationofphenomenaneededtovalidateexploratoryresults

•  Evaluation•  Evaluationoftheperformanceinhierarchicalphraseandsyntax-basedMTandneuralnetworktranslationmodels(withtheoreticalcapacitytolearnDMWU)

•  Annotation•  Manualmultilingualalignments(goldsets)

•  AlignmentGuidelines•  Improvedandenlargedsetsoflinguistically-based/motivatedalignmentguidelines(goldstandards)

•  Cross-LinguisticAnalysis•  Deepanalysisofchallengingcross-linguisticphenomena,includingDMWU

•  Rule/GrammarConstruction•  Translationrulesextractedfromqualitymanually-annotatedcorpora

•  ToolEnhancementandAutomation•  FeedCLUE-AlignerwithmanualtrainingdataandenhancethetoolforautomaticalignmentandextractionoflargeamountsoftranslationpairsforMTcasestudies

•  TranslationApplications•  IncreaseprecisionandrecallinMTsystems

•  Paraphrases•  Methodologyandresources-avaluableassetforapplicationsrequiringparaphrases

ConclusionsandFutureDirections

21

•  Extremeimportanceofparaphrasesfortranslation(humanandMT)

•  Paraphrasticknowledgeallowschoosingthebest/optimaltranslationsfromasetofpossibletranslations

EN–Itistimetobringthisissuetoaconclusion.EN–Wemustbringthisepisodetoaconclusion.

PT–Estánahoraderesolverestaquestão.PT–Chegouahoradeconcluiresteassunto.PT–PunhamosumpontoUinalnestetema.PT–Temosdeconcluiresteepisódio.

Suggestyourownparaphrase!

FinalRemark

22

TheeSPERToProject

23

the man who is Americanthe man from Americathe man with American nationality…

The American man

https://esperto.l2f.inesc-id.pt/esperto/esperto/demo.pl

Paraphrases 4 Translation (Human + MT)

24

Thankyou!

AcknowledgementsThis research work was supported by Fundação para a Ciência e a Tecnologia (FCT), under project eSPERTo EXPL/MHC-LIN/2260/2013, UID/CEC/50021/2013, and post-doctoral grant SFRH/BPD/91446/2012