Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
MachineTranslationOverview
April23,2020Junjie Hu
MaterialslargelyborrowedfromAustinMatthews
One naturally wonders if the problem of translation could conceivably be
treated as a problem in cryptography. When I look at an article in Russian, I
say: ‘This is really written in English, but it has been coded in some strange symbols. I will now
proceed to decode.’
WarrenWeavertoNorbertWiener,March,1947
Parallel corpus
• We are given a corpus of sentence pairs in two languages to trainour machine translation models.
• Source language is also called foreign language, denoted as f.• Conventionally target language is usually referred to English.
Greek
Egyptian
NoisyChannelMT
Wewantamodelofp(e|f)
NoisyChannelMT
Wewantamodelofp(e|f)
Confusingforeignsentence
NoisyChannelMT
Wewantamodelofp(e|f)
PossibleEnglishtranslation
Confusingforeignsentence
NoisyChannelMT
p(e) e f
channel
“English” “Foreign”
decode
p(f|e)
NoisyChannelMT
“LanguageModel” “TranslationModel”
NoisyChannelDivisionofLabor
• Languagemodel– p(e)• isthetranslationfluent,grammatical,andidiomatic?• useanymodelofp(e) – typicallyann-grammodel
• Translationmodel– p(f|e)• “reverse”translationprobability• ensuresadequacyoftranslation
LanguageModelFailure
MylegalnameisAlexanderPerchov.
LanguageModelFailure
MylegalnameisAlexanderPerchov.ButallofmymanyfriendsdubmeAlex,becausethatisamoreflaccid-to-utterversionofmylegalname.MotherdubsmeAlexi-stop-spleening-me!,becauseIamalwaysspleening her.
LanguageModelFailure
MylegalnameisAlexanderPerchov.ButallofmymanyfriendsdubmeAlex,becausethatisamoreflaccid-to-utterversionofmylegalname.MotherdubsmeAlexi-stop-spleening-me!,becauseIamalwaysspleening her.IfyouwanttoknowwhyIamalwaysspleening her,itisbecauseIamalwayselsewherewithfriends,anddisseminatingsomuchcurrency,andperformingsomanythingsthatcanspleenamother.
TranslationModel
• p(f|e) givesthechannelprobability– theprobabilityoftranslatinganEnglishsentenceintoaforeignsentence
• f =je voudrais unpeu defrommage• e1 =Iwouldlikesomecheesee2 =Iwouldlikealittleofcheesee3 =ThereisnotraintoBarcelona
0.40.5>0.00001
p(f|e)
TranslationModel
• Howdoweparameterizep(f|e)?
• Therearea lot of possible sentences (closed to infinite number):• We can only count the sentences in our training data• thiswon’tgeneralizetonewinputs
?
LexicalTranslation
• Howdowetranslateaword?Lookitupinadictionary!Haus:house,home,shell,household
• Multipletranslations• Differentwordsenses,differentregisters,differentinflections• house,home arecommon• shell isspecialized(theHausofasnailisitsshell)
Howcommoniseach translation?
Translation Counthouse 5000home 2000shell 100
household 80
Maximum Likelihood Estimation (MLE)
LexicalTranslation
• Goal:amodelp(e|f,m)• wheree andf arecompleteEnglishandForeignsentences
LexicalTranslation
• Goal:amodelp(e|f,m)• wheree andf arecompleteEnglishandForeignsentences• Lexicaltranslationmakesthefollowingassumptions:
• Eachwordei ine isgeneratedfromexactlyonewordinf• Thus,wehavealatentalignmentai thatindicateswhichwordei “camefrom.”Specificallyitcamefromfai.
• Giventhealignmentsa,translationdecisionsareconditionallyindependentofeachotheranddependonly onthealignedsourcewordfai.
LexicalTranslation
• Puttingourassumptionstogether,wehave:
where a is an m-dimensional latent vector with each element ai in the rangeof [0,n].
p(Alignment) p(Translation |Alignment)
Word Alignment
• Mostoftheresearch forthefirst10yearsofSMTwas here.Wordtranslations weren’ttheproblem.Wordorder washard.
Word Alignment
• Alignmentscanbevisualizedbydrawinglinksbetweentwosentences,andtheyarerepresentedasvectorsofpositions:
Reordering
• Wordsmaybereorderedduringtranslation
WordDropping
• Asourcewordmaynotbetranslatedatall
WordInsertion
• Wordsmaybeinsertedduringtranslation• E.g.Englishjust doesnothaveanequivalent• Butthesewordsmustbeexplained– wetypicallyassumeeverysourcesentencecontainsaNULLtoken
One-to-manyTranslation
• Asourcewordmaytranslateintomorethanone targetword
Many-to-oneTranslation
• Morethanonesourcewordmaynot translateasaunitinlexicaltranslation
IBMModel1
• Simplestpossiblelexicaltranslationmodel• Additionalassumptions:
• Them alignmentdecisionsareindependent• Thealignmentdistributionforeachai isuniformoverallsourcewordsandNULL
TranslatingwithModel1
TranslatingwithModel1
Languagemodelsays:J
TranslatingwithModel1
Languagemodelsays:L
LearningLexicalTranslationModels
• Howdowelearntheparametersp(e|f) on the training corpusof (f, e) sentence pairs?
• “Chickenandegg”problem• Ifwehadthealignments,wecouldestimatethetranslationprobabilities(MLEestimation)
• Ifwehadthetranslationprobabilitieswecouldfindthemostlikelyalignments(greedy)
Expectation-Maximization (EM) Algorithm
• Picksomerandom(oruniform)startingparameters• Repeatuntilbored(~5iterationsforlexicaltranslationmodels):
• Usingthecurrentparameters,compute“expected”alignmentsp(ai|e,f)foreverytargetwordtokeninthetrainingdata
• Keeptrackoftheexpectednumberoftimesf translatesintoe throughoutthewholecorpus
• Keeptrackofthenumberoftimesf isusedinthesourceofanytranslation• Usethesefrequency estimatesinthestandardMLEequationtogetabettersetofparameters
EMforIBM Model1
EMforModel1
EMforModel1
EMforModel1
Convergence
Extensions: Lexical to Phrase Translation
• Phrase-basedMT:• Allowmultiplewordstotranslateaschunks(includingmany-to-one)• Introduceanotherlatentvariable,thesourcesegmentation
Extensions: Alignment Heuristics
• AlignmentPriors:• Insteadofassumingthealignmentdecisionsareuniform,impose(orlearn)aprioroveralignmentgrids:
Chahuneau etal.(2013)
Extensions: Hierarchical Phrase-based MT
• Syntacticstructure• Rulesoftheform:• X之一à oneoftheX
Chang(2005),Galleyetal.(2006)
MT Evaluation
• Howdoweevaluatetranslationsystems’output?• Centralidea:“Thecloseramachinetranslationistoaprofessionalhumantranslation,thebetteritis.”
• MostcommonlyusedmetriciscalledBLEU, which is the geometricmean of the n-gram precision against the human translations plus alength penalty term.
BLEU:AnExample
Candidate1:Itisaguidetoactionwhichensuresthatthemilitaryalways obeythecommandsoftheparty.
Reference1:Itisaguidetoaction that ensuresthatthemilitary willforeverheedPartycommands.
Reference2:Itistheguidingprinciplewhich guaranteesthe militaryforcesalways beingunderthe commandof theParty.
Reference3:Itisthepracticalguideforthearmyalwaystoheeddirectionsoftheparty.
Unigram Precision : 17/18
AdaptedfromslidesbyArthurChan
IssueofN-gramPrecision
• Whatifsomewordsareover-generated?• e.g.“the”
• Anextremeexample
Candidate:thethe the the the the the.Reference1:The catisonthemat.Reference2:Thereisacatonthemat.
• N-gramPrecision:7/7• Solution:referencewordshouldbeexhaustedafteritismatched.
AdaptedfromslidesbyArthurChan
IssueofN-gramPrecision
• Whatifsomewordsarejustdropped?• Anotherextremeexample
Candidate:the.Reference1:Mymomlikestheblueflowers.Reference2:Mymotherpreferstheblueflowers.
• N-gramPrecision:1/1• Solution:addapenaltyifthecandidateistooshort.
AdaptedfromslidesbyArthurChan
BLEU
ClippedN-gramprecisionsforN=1,2,3,4
GeometricAverage
BrevityPenalty
• Rangesfrom0.0to1.0,butusuallyshownmultipliedby100• Anincreaseof+1.0BLEUisusuallyaconferencepaper• MTsystemsusuallyscoreinthe10sto30s• Humantranslatorsusuallyscoreinthe70sand80s
AShortSegue
• Word- andphrase-based(“symbolic”)modelswerecuttingedgefordecades(upuntil~2014)
• Suchmodelsarestillthemostwidelyusedincommercialapplications• Since2014mostresearchonMThasfocusedonneuralmodels
“Neurons”
“Neurons”
“Neurons”
“Neurons”
“Neurons”
“Neural”Networks
“Neural”Networks
“Neural”Networks
“Neural”Networks
“Softmax”
“Neural”Networks
“Deep”
“Deep”
“Deep”
“Deep”
“Deep”
“Deep”
“Deep”
Note:
“Recurrent”
DesignDecisions
• Howtorepresentinputsandoutputs?• Neuralarchitecture?
• Howmanylayers?(Requiresnon-linearities toimprovecapacity!)• Howmanyneurons?• Recurrentornot?• Whatkindofnon-linearities?
RepresentingLanguage
• “One-hot”vectors• Eachpositioninavectorcorrespondstoawordtype
• Distributedrepresentations• Vectorsencode“features”ofinputwords(charactern-grams,morphologicalfeatures,etc.)
dog=
Aardvark
Aabalone
Abando
nAb
ash
… Dog
…
dog=
TrainingNeuralNetworks
• Neuralnetworksaresupervisedmodels– youneedasetofinputspairedwithoutputs
• Algorithm• Rununtilbored:
• Giveinputtothenetwork,seewhatitpredicts• Computeloss(y,y*)• Usechainrule(aka“backpropagation”)tocomputegradientwithrespecttoparameters• Updateparameters(SGD,Adam,LBFGS,etc.)
NeuralLanguageModels
tanh
softmax
x=x Bengio etal.(2013)
Bengioetal.(2003)
NeuralFeaturesforTranslation
• TurnBengio etal.(2003)intoatranslationmodel• Condtional model,generatethenextEnglishwordconditionedon
• Thepreviousn Englishwordsyougenerated• Thealignedsourcewordanditsm neighbors
Devlinetal.(2014)
tanh
softmax
x=x Devlinetal.(2014)
NeuralFeaturesforTranslation
Devlinetal.(2014)
NotationSimplification
RNNsRevisited
FullyNeuralTranslation
• Fullyend-to-endRNN-basedtranslationmodel• EncodethesourcesentenceusingoneRNN• GeneratethetargetsentenceonewordatatimeusinganotherRNN
Encoder
I am a student je suis étudiant
je suis étudiant
Decoder
Sutskever etal.(2014)
AttentionalModel
• Theencoder-decodermodelstruggleswithlongsentences• AnRNNistryingtocompressanarbitrarilylongsentenceintoafinite-lengthworthvector
• Whatifweonlylookatone(orafew)sourcewordswhenwegenerateeachoutputword?
Bahdanau etal.(2014)
TheIntuition
83
large blackOur dog bit the poor mailman .
うち の ⼤きな ⽝ が 可哀想な 郵便屋 に 噛み ついた 。⿊い
Bahdanau etal.(2014)
TheAttentionModel
Encoder
I am a student
Decoder
Bahdanau etal.(2014)
TheAttentionModel
Encoder
I am a student
Decoder
AttentionModel
Bahdanau etal.(2014)
TheAttentionModel
Encoder
I am a student
Decoder
AttentionModel
softmax
Bahdanau etal.(2014)
TheAttentionModel
Encoder
I am a student
Decoder
AttentionModel
ContextVector
Bahdanau etal.(2014)
TheAttentionModel
Encoder
I am a student
je
Decoder
AttentionModel
ContextVector
Bahdanau etal.(2014)
TheAttentionModel
Encoder
I am a student je
je
Decoder
AttentionModel
ContextVector
Bahdanau etal.(2014)
TheAttentionModel
Encoder
I am a student je
je
Decoder
AttentionModel
Bahdanau etal.(2014)
TheAttentionModel
Encoder
I am a student je
je suis
Decoder
AttentionModel
ContextVector
Bahdanau etal.(2014)
TheAttentionModel
Encoder
I am a student je suis
je suis
Decoder
AttentionModel
ContextVector
Bahdanau etal.(2014)
TheAttentionModel
Encoder
I am a student je suis étudiant
je suis étudiant
Decoder
AttentionModel
ContextVector
Bahdanau etal.(2014)
TheAttentionModel
Encoder
I am a student je suis étudiant
je suis étudiant
Decoder
AttentionModel
ContextVector
Bahdanau etal.(2014)
ConvolutionalEncoder-Decoder
• CNN:• encodeswordswithinafixedsizewindow• Parallelcomputation• Shortestpathtocoverawiderrangeofwords
• RNN:• sequentiallyencodeasentencefromlefttoright
• Hardtoparallelize
Gehring et.al2017
TheTransformer
• Idea:InsteadofusinganRNNtoencodethesourcesentenceandthepartialtargetsentence,useself-attention!
Vaswanietal.(2017)
I am a student I am a student
StandardRNNEncoder SelfAttentionEncoder
rawwordvector
word-in-contextvector
TheTransformer
Encoder
je suis étudiant
je suis étudiant
Decoder
AttentionModel
ContextVector
I am a student
Vaswanietal.(2017)
Transformer
• Traditionalattention:• Query:decoderhiddenstate• KeyandValue:encoderhiddenstate• Attendtosourcewordsbasedonthecurrentdecoderstate
• Self-attention:• Query,Key,Valuearethesame• Attendtosurroundingsourcewordsbasedonthecurrentsourceword
• Attendtopreceeding targetwordsbasedonthecurrenttargetword
Vaswanietal.(2017)
VisualizationofAttentionWeight
• Self-attentionweightcandetectlong-termdependencywithinasentence,e.g.,make… moredifficult
TheTransformer
• Computationiseasilyparallelizable• Shorterpathfromeachtargetwordtoeachsourcewordà strongergradientsignals• Empiricallystrongertranslationperformance• Empiricallytrainssubstantiallyfasterthanmoreserialmodels
CurrentResearchDirectionsonNeuralMT
• IncorporationsyntaxintoNeuralMT• Handlingofmorphologicallyrichlanguages• Optimizingtranslationquality(insteadofcorpusprobability)• Multilingualmodels• Document-leveltranslation