View
271
Download
0
Embed Size (px)
Citation preview
technology from seed
MachineTranslationofDiscontinuous
MultiwordUnits
DISCO, NAACL - San Diego USA �June 17th 2016
ANABELABARREIROINESC-ID
FERNANDOBATISTAINESC-ID,ISCTE-IUL
• Introduction– DiscontinuousMultiwordUnits(DMWU)inNLP– MainCurrentShortcomings– OurGoal
• CLUE-AlignerAlignmentTool• TheLogosModel
– AlignmentofDMWUInspiredbyLogos• bring[]toaconclusion• set[]inmotion• play[]role• take[]interestin• keep[]informedabout
• PreliminaryResults– AnalysisofPreliminaryResults
• AdvantagesoftheLogosModel• ConclusionsandFutureDirections• FinalRemark
– TheeSPERToProject
Outline
2
• Increasinginterestinmultiwordunits(MWU)intheRieldofNLP
“lexical items that: (a) can be decomposed intomultiple lexemes; and (b) display lexical,syntactic,semantic,pragmaticand/orstatisticalidiomaticity”(BaldwinandKim2010)
• Compositionality property – causes automatic processing of MWUparticularlychallenging– Freecombinations
roundtable=meeting– Opaquemeanings
• pieceofcake=easytodo• payavisit=visit
– Cannotbetranslatedword-for-word• rainingcatsanddogs
– Allowinsertions(=wordsthatarenotpartoftheunit)• tobring[INSERTION]toaconclusionIwouldurgetheEuropeanCommissiontobringtheprocessofadoptingthedirectiveonadditionalpensionstoaconclusion
Introduction
3
• Non-adjacentlinguisticphenomena–remotedependency• Commonacrosslanguages• DifRiculttorecognizeandprocess• RemainaproblemforNLPapplications
• LackofformalizationstilltriggersproblemswiththesyntacticandsemanticanalysisofsentencescontainingMWU
• ImpairmentofNLPsystems’performance• CauseMTtofailinassigningthecorrecttranslation
• For SMT systems, DMWU constitute signiRicant challenges tocorrect word and phrase alignment (Shen et al. 2009), andtherefore,tohighqualityMT
DiscontinuousMultiwordUnitsinNLP
4
• Linguisticknowledgeisstilllimitedinmostsystems– Some SMTmethodologies relymostly on statistics to train/evaluate
MT systems, use probabilistic alignments with no/little linguisticknowledge,disregardsyntacticdiscontinuity.
– InabilitytoidentifyMWUcorrectlyresultsintranslationdeRiciencies.• Lack of publicly available manual multilingual datasets, and of
linguisticallymotivatedalignmentguidelines– Publicly available alignments are mostly bilingual, with some
exceptions(Graçaetal.2008)– Guidelines cover cross-linguistic phenomena superRicially, excluding
importantalignmentchallengespresentedbyDMWU.• Lackofmorerobustalignmenttools
– Limitations in assisting human annotators in the task of identifyingandaligningcorrectlyDMWUandproducerulesfromthem.
MainCurrentShortcomings
5
• PresentanexperimentalempiricalanalysisofDMWU• Stresstherelevanceofcorrect(andnon-arbitrary)alignmentofDMWU• Highlightanalignmentmethodology inspiredby theLogosModel (Scott,
2003;Barreiroetal.,2011)andtheSemtabfunctiontodeploysemantico-syntacticknowledgethatallowstotranslateDMWUwithhighRidelity
• IllustrateDMWUmanualalignmentsproducedwithCLUE-Aligner–Cross-Language Unit Elicitation – a Web alignment interactive tool (Barreiro,Raposo,Luís2016)
*Even though similar in name to the "clue alignment approach” (Tiedemann, 2003; 2004; 2011),mainly devoted to word-level alignment, our approach is theoretically and methodologicallydifferent with a focus on phrase alignment, contemplatingmultiwords and linguistically-relevantphrasalunits.
Ourgoal
6
• Allowstheblock-alignmentofcontiguousandDMWU• Uses amatrixvisualization andcoloring schemes that helpdistinguish
betweensure(S)andpossible(P)alignments• Allowsstorageofpairsofparaphrasticunits,withindicationoftheplace
ofinsertions,representedby"[]"– Iurge[]to|Exorto[]a– This feature is valuable in the construction of translation rules orgrammarsandsyntacticparsers thatusethoseparaphrasticpairs, forwhichprecisionisimportant
– ItisalsoimportantinMLtohelplearningconstituents
7
CLUE-Aligner
�
insertion
insertion
Black cells represent full/optimal semantic correspondenceGrey cells represent approximate semantic correspondence
LightorangecellgroupsrepresentunalignedP-inser3onsDarkorangecellgroupsrepresentunalignedS-inser3ons
�
pre-processingofcontractedforms
still ainda
CLUE-AlignerInterface
SingleWordAlignmentsandBlockAlignments
DMWs
andInsertionsLightgreencell/cellgroupsrepresentalignedP-inser3onsDarkgreencell/cellgroupsrepresentalignedS-inser3ons
• Integrates semantic and contextual knowledge and applies it to thetranslationprocess
• Precision is associated with the application of Semtab semantic andcontextualdata-drivenpattern-rules,whicharedeepstructurepatternsthat match on (apply to) a great variety of surface structures, includingDMWU– deal(VI)withN(questions)=s’occuperdeN
• AlignmentsthatmirrorSemtabsemanticnuancescanhelpcreatenewMTsystemsandimproveexistingones
TheLogosModel
10
AlignmentofDMWUInspiredbyLogos
11
• Europarlcorpus(Koehn2005)-containsalargenumberofoccurrencesofDMWU(subsetwith47.4millionwords)
• 5casesofSVCillustrate“bad”translationerrors– Searchwasperformedonallformsofeachverb
– Learning automatic models to deal with DWMU may not bestraightforward
bring[]toaconclusion
12
Containsa9wordinser3on
AlignmentoftheENdiscontinuousSVCwiththePTequivalentstylistic
variant,thecompoundverb“apressar-seaapresentar”
set[]inmotion
13
AlignmentoftheENdiscontinuousSVCwiththeequivalentPTsingleverb“empreender”(“undertake”)
Inser3onofthedirectobject
play[]role
14
AlignmentoftheENSVCwiththeequivalentPTnon-elementarySVC
“desempenhamumpapel”
Inser3onofanadjec3vemodifier
take[]interestin
15
Inser3onofanadjec3ve
AlignmentofaENdiscontinuousprepositionalverbwithitsequivalentFR
reUlexiveprepositionalverb
FRtranslationalsocontainsinsertions
keep[]informedabout
16
Inser3onofanadverb
AlignmentofaENdiscontinuousSVCwithaprepositionaladjectivewithits
ESequivalentprepositionalSVC
ENtranslationalsocontainsanadverbialinsertion
• 1st20sentencesfromsubsetcorpusrepresentingeachofthe5DMWUcases• TranslatedeachsentencewithGoogleTranslatetoverifytranslationquality• Performedanempiricalevaluationoftheachievedtranslations
PreliminaryResults
17
bring [ ] to a conclusion
set [ ] in motion play [ ] role take [ ] interest in keep [ ] informed about
0
5
10
15
20
25
correct
incorrect, inadequate or non-optimal (literal, unnatural)
AnalysisofPreliminaryResults
18
DMWU (support verb construction) Google Translate Correct translation
to bring [this dossier] to a conclusion trazer a uma conclusão concluir / terminar [este dossier]
set […] in motion estabeleceu […] em movimento iniciou / pôs em marcha […]
play [the] role jogar [o] papel desempenhar [o] papel
take [a lukewarm] interest in *ter um interesse [*morna] em manifeste / demonstre um interesse [morno/fraco/ténue]
keep [us] informed about *tem [nos] *manteve informados sobre nos tem mantido informados sobre nos tem informado sobre
EN–ItisunacceptablefortheCommissiononlytotakealukewarminterestinacountry.PT-GT–ÉinaceitávelqueaComissãosóa*teruminteressemornaemumpaís.LexicalerrorsrelatedtoDMWU+Structuralerrors• Lackofagreement(paranosmanterregulareestreitamente
*informadosobre;queoParlamento*serbem*informadossobre)• Incorrectwordorder(seconseguirmos*aadoptarede5ini-loem
movimento)• Etc.
AdvantagesoftheLogosModel
19
• Consistent and efRicient solution to process DMWU, not consistentlyprocessedinformerwordorphrasealignmenttechniques
• Ability to relate constituents that are apart (even very far apart) in thesentence
• Consistentwaytoanalyzeandtranslatewordsincontext• AbilitytogeneralizebetweenalternativeformsofthesameMWU,phrase
orexpression(takeawalk=walk)• Semtabhasa robust solution for theproblemofopenclass itemsor less
frequentMWU and phrases that cannot be learnt quickly and translatedcorrectly by an SMT system, but annoyingly can be observed in MTtranslations(alsousedinnon-nativespeakearisms)– makeavisitorpayavisit?
• MWUarenotprocessedonaword-for-wordbasis, theyrepresentatomicsemantico-syntacticandtranslationunits
• StandardMTsystemscanbeneRitfromacorrectprocessingofDMWU• currentlynotbeingexploredefRiciently• processing,recognitionandtranslationofDMWUischallenging
• SomemethodologiesareinefRicient• theyviolatetheintrinsicpropertyoftheunitasanatomicgroupof
elements• elementsoftheunitcannotbeseparatedoralignedindividually• unitboundariesneedtoberespected
• Post-editingeffortscanbeminimizedbyimprovingalignmentquality• EventhoughweanalyzedjustafewcasesofSVC,ourRindingspointouttoa
generallackofqualityinthetranslationofDMWU(anddiscontinuousphrasalexpressions)
ConclusionsandFutureDirections
20
• Validation• BroaderquantiRicationofphenomenaneededtovalidateexploratoryresults
• Evaluation• Evaluationoftheperformanceinhierarchicalphraseandsyntax-basedMTandneuralnetworktranslationmodels(withtheoreticalcapacitytolearnDMWU)
• Annotation• Manualmultilingualalignments(goldsets)
• AlignmentGuidelines• Improvedandenlargedsetsoflinguistically-based/motivatedalignmentguidelines(goldstandards)
• Cross-LinguisticAnalysis• Deepanalysisofchallengingcross-linguisticphenomena,includingDMWU
• Rule/GrammarConstruction• Translationrulesextractedfromqualitymanually-annotatedcorpora
• ToolEnhancementandAutomation• FeedCLUE-AlignerwithmanualtrainingdataandenhancethetoolforautomaticalignmentandextractionoflargeamountsoftranslationpairsforMTcasestudies
• TranslationApplications• IncreaseprecisionandrecallinMTsystems
• Paraphrases• Methodologyandresources-avaluableassetforapplicationsrequiringparaphrases
ConclusionsandFutureDirections
21
• Extremeimportanceofparaphrasesfortranslation(humanandMT)
• Paraphrasticknowledgeallowschoosingthebest/optimaltranslationsfromasetofpossibletranslations
EN–Itistimetobringthisissuetoaconclusion.EN–Wemustbringthisepisodetoaconclusion.
PT–Estánahoraderesolverestaquestão.PT–Chegouahoradeconcluiresteassunto.PT–PunhamosumpontoUinalnestetema.PT–Temosdeconcluiresteepisódio.
Suggestyourownparaphrase!
FinalRemark
22
TheeSPERToProject
23
the man who is Americanthe man from Americathe man with American nationality…
The American man
https://esperto.l2f.inesc-id.pt/esperto/esperto/demo.pl
Paraphrases 4 Translation (Human + MT)