Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Seq2Seq+A(en*on/Copy
WeiXu(many slides from Greg Durrett)
ThisWeek
‣ Sequence-to-SequenceModel
‣ A(en*onMechanism
‣ CopyMechanism
‣ TransformerArchitecture
Recall:CNNsvs.LSTMs
‣ BothLSTMsandconvolu*onallayerstransformtheinputusingcontext
themoviewasgood themoviewasgood
nxk
cfilters,mxkeach
O(n)xc
nxk
nx2c
BiLSTMwith hiddensizec
‣ LSTM:“globally”looksattheen*resentence(butlocalformanyproblems)
‣ CNN:localdependingonfilterwidth+numberoflayers
LSTMs
xj
fg
io
hjhj-1
cj-1 cj
h(p://colah.github.io/posts/2015-08-Understanding-LSTMs/
‣ f,i,oaregatesthatcontrolinforma*onflow
‣ greflectsthemaincomputa*onofthecell
hj
Hochreiter&Schmidhuber(1997)
LSTMs
‣ Gradients*lldiminishes,butinacontrolledwayandgenerallybyless—usuallyini*alizeforgetgate=1toremembereverythingtostart
<-gradientsimilargradient<-
h(p://colah.github.io/posts/2015-08-Understanding-LSTMs/
Encoder-Decoder‣ Encodeasequenceintoafixed-sizedvector
themoviewasgreat
‣ NowusethatvectortoproduceaseriesoftokensasoutputfromaseparateLSTMdecoder
lefilmétaitbon[STOP]
Sutskeveretal.(2014)
‣Machinetransla*on,NLG,summariza*on,dialog,andmanyothertasks(e.g.,seman*cparsing,syntac*cparsing)canbedoneusingthisframework.
Encoder-Decoder
‣ Isthistrue?Sortof…we’llcomebacktothislater
Model‣ Generatenextwordcondi*onedonpreviouswordaswellashiddenstate
themoviewasgreat <s>
h̄
‣Wsizeis|vocab|x|hiddenstate|,sopmaxoveren*revocabulary
Decoderhasseparateparametersfromencoder,sothiscanlearntobealanguagemodel(produceaplausiblenextwordgivencurrentone)
P (y|x) =nY
i=1
P (yi|x, y1, . . . , yi�1)
P (yi|x, y1, . . . , yi�1) = softmax(Wh̄)
Inference‣ Generatenextwordcondi*onedonpreviouswordaswellashiddenstate
themoviewasgreat
‣ Duringinference:needtocomputetheargmaxoverthewordpredic*onsandthenfeedthattothenextRNNstate
le
<s>
‣ Needtoactuallyevaluatecomputa*ongraphuptothispointtoforminputforthenextstate
‣ Decoderisadvancedonestateata*meun*l[STOP]isreached
film était bon [STOP]
Implemen*ngseq2seqModels
themoviewasgreat
‣ Encoder:consumessequenceoftokens,producesavector.Analogoustoencodersforclassifica*on/taggingtasks
le
<s>
‣ Decoder:separatemodule,singlecell.Takestwoinputs:hiddenstate(vectorhortuple(h,c))andprevioustoken.Outputstoken+newstate
Encoder
…
film
le
Decoder Decoder
Training
‣ Objec*ve:maximize
themoviewasgreat <s> lefilmétaitbon
le
‣ Onelosstermforeachtarget-sentenceword,feedthecorrectwordregardlessofmodel’spredic*on
[STOP]était
X
(x,y)
nX
i=1
logP (y⇤i |x, y⇤1 , . . . , y⇤i�1)
Training:ScheduledSampling
‣ Star*ngwithp=1anddecayingitworksbest
‣ Scheduledsampling:withprobabilityp,takethegoldasinput,elsetakethemodel’spredic*on
themoviewasgreat
lafilmétaisbon[STOP]
le film était
‣Modelneedstodotherightthingevenwithitsownpredic*ons
Bengioetal.(2015)
sample
Implementa*onDetails
‣ Sentencelengthsvaryforbothencoderanddecoder:
‣ Typicallypadeverythingtotherightlength
‣ Encoder:CanbeaCNN/LSTM/…
‣ Decoder:Executeonestepofcomputa*onata*me,socomputa*ongraphisformulatedastakingoneinput+hiddenstate.Un*lreach<STOP>.
‣ Beamsearch:canhelpwithlookahead.Findsthe(approximate)highestscoringsequence:
argmaxy
nY
i=1
P (yi|x, y1, . . . , yi�1)
Implementa*onDetails
‣ Fairseq:
BeamSearch‣Maintaindecoderstate,tokenhistoryinbeam
la:0.4
<s>
la
le
les
le:0.3les:0.1
log(0.4)log(0.3)
log(0.1)
film:0.4
la
…
film:0.8
le
… lefilm
lafilm
log(0.3)+log(0.8)
…
log(0.4)+log(0.4)
‣ Keepbothfilmstates!Hiddenstatevectorsaredifferent
themoviewasgreat
RegexPredic*on
‣ Canuseforotherseman*cparsing-liketasks
‣ Predictregexfromtext
‣ Problem:requiresalotofdata:10,000examplesneededtoget~60%accuracyonpre(ysimpleregexes
Locascioetal.(2016)
Seman*cParsingasTransla*on
JiaandLiang(2015)
‣Writedownalinearizedformoftheseman*cparse,trainseq2seqmodelstodirectlytranslateintothisrepresenta*on
‣Mightnotproducewell-formedlogicalforms,mightrequirelotsofdata
“whatstatesborderTexas”
‣ Noneedtohaveanexplicitgrammar,simplifiesalgorithms
https://www.youtube.com/watch?v=OocGXG-BY6k&t=200sSeman*cParsing/LambdaCalculus:
SQLGenera*on
‣ Convertnaturallanguagedescrip*onintoaSQLqueryagainstsomeDB
‣ Howtoensurethatwell-formedSQLisgenerated?
Zhongetal.(2017)
‣ Threecomponents
‣ Howtocapturecolumnnames+constants?
‣ Pointermechanisms
A(en*on
ProblemswithSeq2seqModels
‣ Needsomeno*onofinputcoverageorwhatinputwordswe’vetranslated
‣ Encoder-decodermodelsliketorepeatthemselves:
AboyplaysinthesnowboyplaysboyplaysUngarçonjouedanslaneige
‣ Openabyproductoftrainingthesemodelspoorly.Inputisforgo(enbytheLSTMsoitgetsstuckina“loop”ofgenera*onthesameoutputtokensagainandagain.
ProblemswithSeq2seqModels
‣ Badatlongsentences:1)afixed-sizehiddenrepresenta*ondoesn’tscale;2)LSTMss*llhaveahard*merememberingforreallylongsentences
RNNenc:themodelwe’vediscussedsofar
RNNsearch:usesa(en*on
Bahdanauetal.(2014)
ProblemswithSeq2seqModels
‣ Unknownwords:
‣ Encodingtheserarewordsintoavectorspaceisreallyhard‣ Infact,wedon’twanttoencodethem,wewantawayofdirectlylookingbackattheinputandcopyingthem(Pont-de-Buis)
Jeanetal.(2015),Luongetal.(2015)
AlignedInputs
<s>lefilmétaitbon
themoviewasgreat
themoviewasgreat
lefilmétaitbon
‣Muchlessburdenonthehiddenstate
‣ Supposeweknewthesourceandtargetwouldbeword-by-wordtranslated
‣ Canlookatthecorrespondinginputwordwhentransla*ng—thiscouldscale!
lefilmétaitbon[STOP]
‣ Howcanweachievethiswithouthardcodingit?
A(en*on
‣ Ateachdecoderstate,computeadistribu*onoversourceinputsbasedoncurrentdecoderstatethemoviewasgreat <s> le
themovie wa
sgreatth
emovie wa
sgreat
… …
‣ Usethatinoutputlayer
A(en*on
themoviewasgreat
h1 h2 h3 h4
<s>
h̄1
‣ Foreachdecoderstate,computeweightedsumofinputstates
eij = f(h̄i, hj)
ci =X
j
↵ijhj
c1
‣ Somefunc*onf(nextslide)
‣Weightedsumofinputhidden states(vector)
le
↵ij =exp(eij)Pj0 exp(eij0)
P (yi|x, y1, . . . , yi�1) = softmax(W [ci; h̄i])
P (yi|x, y1, . . . , yi�1) = softmax(Wh̄i)‣ Noa(n:
the
movie was
great
A(en*on
<s>
h̄1
eij = f(h̄i, hj)
ci =X
j
↵ijhj
c1
‣ Notethatthisallusesoutputsofhiddenlayers
f(h̄i, hj) = tanh(W [h̄i, hj ])
f(h̄i, hj) = h̄i · hj
f(h̄i, hj) = h̄>i Whj
‣ Bahdanau+(2014):addi*ve
‣ Luong+(2015):dotproduct
‣ Luong+(2015):bilinear
le
↵ij =exp(eij)Pj0 exp(eij0)
Whatcana(en*ondo?‣ Learningtocopy—howmightthiswork?
Luongetal.(2015)
0 3 2 1
0 3 2 1
‣ LSTMcanlearntocountwiththerightweightmatrix
‣ Thisiseffec*velyposi*on-basedaddressing
Whatcana(en*ondo?‣ Learningtosubsampletokens
Luongetal.(2015)
0 3 2 1
3 1
‣ Needtocount(forordering)andalsodeterminewhichtokensarein/out
‣ Content-basedaddressing
A(en*on
‣ Decoderhiddenstatesarenowmostlyresponsibleforselec*ngwhattoa(endto
‣ Doesn’ttakeacomplexhiddenstatetowalkmonotonicallythroughasentenceandspitoutword-by-wordtransla*ons
‣ Encoderhiddenstatescapturecontextualsourcewordiden*ty
themoviewasgreat
BatchingA(en*on
Luongetal.(2015)
themoviewasgreat
tokenoutputs:batchsizexsentencelengthxdimension
sentenceoutputs:batchsizexhiddensize
<s>
hiddenstate:batchsizexhiddensize
eij = f(h̄i, hj)
↵ij =exp(eij)Pj0 exp(eij0)
a(en*onscores=batchsizexsentencelength
c=batchsizexhiddensize ci =X
j
↵ijhj
‣Makesuretensorsaretherightsize!
Results‣Machinetransla*on:BLEUscoreof14.0onEnglish-German->16.8witha(en*on,19.0withsmartera(en*on(constrainedtoasmallwindows)
Luongetal.(2015) Chopraetal.(2016) JiaandLiang(2016)
‣ Summariza*on/headlinegenera*on:bigramrecallfrom11%->15%
‣ Seman*cparsing:~30%accuracy->70+%accuracyonGeoquery
CopyingInput/Pointers
UnknownWords
Jeanetal.(2015),Luongetal.(2015)
‣Wanttobeabletocopynameden**eslikePont-de-Buis
P (yi|x, y1, . . . , yi�1) = softmax(W [ci; h̄i])
froma(en*onfromRNNhiddenstate
‣ Problem:targetwordhastobeinthevocabulary,a8enPon+RNNneedtogenerategoodembeddingtopickit
PointerNetwork‣ Standarddecoder(Pvocab):sopmaxovervocabulary
P (yi|x, y1, . . . , yi�1) = softmax(W [ci; h̄i])
Le ama*n… …
‣ Pointernetwork(Ppointer):predictfromsourcewords, insteadoftargetvocabulary
por*coinPont-de-Buis <s>
…in
Pont-de-Buis
por*co
……
… …
…
PointerGeneratorMixtureModels‣ DefinethedecodermodelasamixturemodelofPvocabandPpointer
Le ama*n
Pont-de-Buis
por*co in… …
‣ PredictP(copy)basedondecoderstate,input,etc.
v
‣Marginalizeovercopyvariableduringtrainingandinference
‣Modelwillbeabletobothgenerateandcopy,flexiblyadaptbetweenthetwo
P(copy)1-P(copy)
Gulcehreetal.(2016),Guetal.(2016)
v
Copying
{ {Lede
ma*n
Pont-de-Buisecotax
…
‣ Vocabularycontains“normal”vocabaswellaswordsininput.Normalizesoverbothofthese:
‣ Bilinearfunc*onofinputrepresenta*on+outputhiddenstate
{P (yi = w|x, y1, . . . , yi�1) /expWw[ci; h̄i]
h>j V h̄i
ifwinvocab
ifw=xj
CopyinginSummariza*on
Seeetal.(2017)
CopyinginSummariza*on
Seeetal.(2017)
CopyinginSummariza*on
Seeetal.(2017)
Transformers
41
SentenceEncoders
themoviewasgreat
‣ LSTMabstracPon:mapseachvectorinasentencetoanew,context-awarevector
‣ CNNsdosomethingsimilarwithfilters
‣ A8enPoncangiveusathirdwaytodothis
Vaswanietal.(2017)
themoviewasgreat
42
Self-A8enPon
Vaswanietal.(2017)
Theballerinaisveryexcitedthatshewilldanceintheshow.
‣ Pronounsneedtolookatantecedents‣ Ambiguouswordsshouldlookatcontext
‣ Assumewe’reusingGloVe—whatdowewantourneuralnetworktodo?
‣Whatwordsneedtobecontextualizedhere?
‣WordsshouldlookatsyntacPcparents/children
‣ Problem:LSTMsandCNNsdon’tdothis
43
Self-A8enPon
Vaswanietal.(2017)
Theballerinaisveryexcitedthatshewilldanceintheshow.
‣Want:
‣ LSTMs/CNNs:tendtolookatlocalcontext
Theballerinaisveryexcitedthatshewilldanceintheshow.
‣ Toappropriatelycontextualizeembeddings,weneedtopassinformaPonoverlongdistancesdynamicallyforeachword
Self-A(en*on
Vaswanietal.(2017)
themoviewasgreat
‣ Eachwordformsa“query”whichthencomputesa(en*onovereachword
‣Mul*ple“heads”analogoustodifferentconvolu*onalfilters.UseparametersWkandVktogetdifferenta(en*onvalues+transformvectors
x4
x04
scalar
vector=sumofscalar*vector
↵i,j = softmax(x>i xj)
x0i =
nX
j=1
↵i,jxj
↵k,i,j = softmax(x>i Wkxj) x0
k,i =nX
j=1
↵k,i,jVkxj
�45
Whatcanself-a8enPondo?
Vaswanietal.(2017)
Theballerinaisveryexcitedthatshewilldanceintheshow.
‣WhymulPpleheads?Socmaxesendupbeingpeaked,singledistribuPoncannoteasilyputweightonmulPplethings
0.5 0.20.10.10.10 0 0 0 0 0 0
‣ ThisisademonstraPon,wewillrevisitwhatthesemodelsactuallylearnwhenwediscussBERT
‣ A8endnearby+tosemanPcallyrelatedterms
0.5 0 0.40 0.1 0 0 0 0 0 0 0
�46
TransformerUses
‣ Supervised:transformercanreplaceLSTMasencoder,decoder,orboth;willrevisitthiswhenwediscussMT
‣ Unsupervised:transformersworkbe8erthanLSTMforunsupervisedpre-trainingofembeddings:predictwordgivencontextwords
‣ BERT(BidirecPonalEncoderRepresentaPonsfromTransformers):pretrainingtransformerlanguagemodelssimilartoELMo
‣ Strongerthansimilarmethods,SOTAon~11tasks(includingNER—92.8F1)
Takeaways
‣ A(en*onisveryhelpfulforseq2seqmodels
‣ Usedfortasksincludingdata-to-textgenera*onandsummariza*on
‣ Explicitlycopyinginputcanbebeneficialaswell
‣ Transformersarestrongmodelswe’llcomebacktolater,if*me