Seq2Seq + A(en*on/Copy · 2020-04-12 · ‣ Encoder-decoder models like to repeat themselves: Un garçon joue dans la neige A boy plays in the snow boy plays boy plays ‣ Open a

Seq2Seq+A(en*on/Copy

WeiXu(many slides from Greg Durrett)

ThisWeek

‣ Sequence-to-SequenceModel

‣ A(en*onMechanism

‣ CopyMechanism

‣ TransformerArchitecture

Recall:CNNsvs.LSTMs

‣ BothLSTMsandconvolu*onallayerstransformtheinputusingcontext

themoviewasgood themoviewasgood

nxk

cfilters,mxkeach

O(n)xc

nxk

nx2c

BiLSTMwith hiddensizec

‣ LSTM:“globally”looksattheen*resentence(butlocalformanyproblems)

‣ CNN:localdependingonfilterwidth+numberoflayers

LSTMs

xj

fg

io

hjhj-1

cj-1 cj

h(p://colah.github.io/posts/2015-08-Understanding-LSTMs/

‣ f,i,oaregatesthatcontrolinforma*onflow

‣ greflectsthemaincomputa*onofthecell

hj

Hochreiter&Schmidhuber(1997)

LSTMs

‣ Gradients*lldiminishes,butinacontrolledwayandgenerallybyless—usuallyini*alizeforgetgate=1toremembereverythingtostart

<-gradientsimilargradient<-

h(p://colah.github.io/posts/2015-08-Understanding-LSTMs/

Encoder-Decoder‣ Encodeasequenceintoafixed-sizedvector

themoviewasgreat

‣ NowusethatvectortoproduceaseriesoftokensasoutputfromaseparateLSTMdecoder

lefilmétaitbon[STOP]

Sutskeveretal.(2014)

‣Machinetransla*on,NLG,summariza*on,dialog,andmanyothertasks(e.g.,seman*cparsing,syntac*cparsing)canbedoneusingthisframework.

Encoder-Decoder

‣ Isthistrue?Sortof…we’llcomebacktothislater

Model‣ Generatenextwordcondi*onedonpreviouswordaswellashiddenstate

themoviewasgreat <s>

h̄

‣Wsizeis|vocab|x|hiddenstate|,sopmaxoveren*revocabulary

Decoderhasseparateparametersfromencoder,sothiscanlearntobealanguagemodel(produceaplausiblenextwordgivencurrentone)

P (y|x) =nY

i=1

P (yi|x, y1, . . . , yi�1)

P (yi|x, y1, . . . , yi�1) = softmax(Wh̄)

Inference‣ Generatenextwordcondi*onedonpreviouswordaswellashiddenstate

themoviewasgreat

‣ Duringinference:needtocomputetheargmaxoverthewordpredic*onsandthenfeedthattothenextRNNstate

le

<s>

‣ Needtoactuallyevaluatecomputa*ongraphuptothispointtoforminputforthenextstate

‣ Decoderisadvancedonestateata*meun*l[STOP]isreached

film était bon [STOP]

Implemen*ngseq2seqModels

themoviewasgreat

‣ Encoder:consumessequenceoftokens,producesavector.Analogoustoencodersforclassifica*on/taggingtasks

le

<s>

‣ Decoder:separatemodule,singlecell.Takestwoinputs:hiddenstate(vectorhortuple(h,c))andprevioustoken.Outputstoken+newstate

Encoder

…

film

le

Decoder Decoder

Training

‣ Objec*ve:maximize

themoviewasgreat <s> lefilmétaitbon

le

‣ Onelosstermforeachtarget-sentenceword,feedthecorrectwordregardlessofmodel’spredic*on

[STOP]était

X

(x,y)

nX

i=1

logP (y⇤i |x, y⇤1 , . . . , y⇤i�1)

Training:ScheduledSampling

‣ Star*ngwithp=1anddecayingitworksbest

‣ Scheduledsampling:withprobabilityp,takethegoldasinput,elsetakethemodel’spredic*on

themoviewasgreat

lafilmétaisbon[STOP]

le film était

‣Modelneedstodotherightthingevenwithitsownpredic*ons

Bengioetal.(2015)

sample

Implementa*onDetails

‣ Sentencelengthsvaryforbothencoderanddecoder:

‣ Typicallypadeverythingtotherightlength

‣ Encoder:CanbeaCNN/LSTM/…

‣ Decoder:Executeonestepofcomputa*onata*me,socomputa*ongraphisformulatedastakingoneinput+hiddenstate.Un*lreach<STOP>.

‣ Beamsearch:canhelpwithlookahead.Findsthe(approximate)highestscoringsequence:

argmaxy

nY

i=1

P (yi|x, y1, . . . , yi�1)

Implementa*onDetails

‣ Fairseq:

BeamSearch‣Maintaindecoderstate,tokenhistoryinbeam

la:0.4

<s>

la

le

les

le:0.3les:0.1

log(0.4)log(0.3)

log(0.1)

film:0.4

la

…

film:0.8

le

… lefilm

lafilm

log(0.3)+log(0.8)

…

log(0.4)+log(0.4)

‣ Keepbothfilmstates!Hiddenstatevectorsaredifferent

themoviewasgreat

RegexPredic*on

‣ Canuseforotherseman*cparsing-liketasks

‣ Predictregexfromtext

‣ Problem:requiresalotofdata:10,000examplesneededtoget~60%accuracyonpre(ysimpleregexes

Locascioetal.(2016)

Seman*cParsingasTransla*on

JiaandLiang(2015)

‣Writedownalinearizedformoftheseman*cparse,trainseq2seqmodelstodirectlytranslateintothisrepresenta*on

‣Mightnotproducewell-formedlogicalforms,mightrequirelotsofdata

“whatstatesborderTexas”

‣ Noneedtohaveanexplicitgrammar,simplifiesalgorithms

https://www.youtube.com/watch?v=OocGXG-BY6k&t=200sSeman*cParsing/LambdaCalculus:

https://www.youtube.com/watch?v=OocGXG-BY6k&t=200s

SQLGenera*on

‣ Convertnaturallanguagedescrip*onintoaSQLqueryagainstsomeDB

‣ Howtoensurethatwell-formedSQLisgenerated?

Zhongetal.(2017)

‣ Threecomponents

‣ Howtocapturecolumnnames+constants?

‣ Pointermechanisms

A(en*on

ProblemswithSeq2seqModels

‣ Needsomeno*onofinputcoverageorwhatinputwordswe’vetranslated

‣ Encoder-decodermodelsliketorepeatthemselves:

AboyplaysinthesnowboyplaysboyplaysUngarçonjouedanslaneige

‣ Openabyproductoftrainingthesemodelspoorly.Inputisforgo(enbytheLSTMsoitgetsstuckina“loop”ofgenera*onthesameoutputtokensagainandagain.


‣ Badatlongsentences:1)afixed-sizehiddenrepresenta*ondoesn’tscale;2)LSTMss*llhaveahard*merememberingforreallylongsentences

RNNenc:themodelwe’vediscussedsofar

RNNsearch:usesa(en*on

Bahdanauetal.(2014)


‣ Unknownwords:

‣ Encodingtheserarewordsintoavectorspaceisreallyhard‣ Infact,wedon’twanttoencodethem,wewantawayofdirectlylookingbackattheinputandcopyingthem(Pont-de-Buis)

Jeanetal.(2015),Luongetal.(2015)

AlignedInputs

<s>lefilmétaitbon

themoviewasgreat

themoviewasgreat

lefilmétaitbon

‣Muchlessburdenonthehiddenstate

‣ Supposeweknewthesourceandtargetwouldbeword-by-wordtranslated

‣ Canlookatthecorrespondinginputwordwhentransla*ng—thiscouldscale!

lefilmétaitbon[STOP]

‣ Howcanweachievethiswithouthardcodingit?

A(en*on

‣ Ateachdecoderstate,computeadistribu*onoversourceinputsbasedoncurrentdecoderstatethemoviewasgreat <s> le

themovie wa

sgreatth

emovie wa

sgreat

… …

‣ Usethatinoutputlayer

A(en*on

themoviewasgreat

h1 h2 h3 h4

<s>

h̄1

‣ Foreachdecoderstate,computeweightedsumofinputstates

eij = f(h̄i, hj)

ci =X

j

↵ijhj

c1

‣ Somefunc*onf(nextslide)

‣Weightedsumofinputhidden states(vector)

le

↵ij =exp(eij)Pj0 exp(eij0)

P (yi|x, y1, . . . , yi�1) = softmax(W [ci; h̄i])

P (yi|x, y1, . . . , yi�1) = softmax(Wh̄i)‣ Noa(n:

the

movie was

great

A(en*on

<s>

h̄1

eij = f(h̄i, hj)

ci =X

j

↵ijhj

c1

‣ Notethatthisallusesoutputsofhiddenlayers

f(h̄i, hj) = tanh(W [h̄i, hj ])

f(h̄i, hj) = h̄i · hj

f(h̄i, hj) = h̄>i Whj

‣ Bahdanau+(2014):addi*ve

‣ Luong+(2015):dotproduct

‣ Luong+(2015):bilinear

le


Whatcana(en*ondo?‣ Learningtocopy—howmightthiswork?

Luongetal.(2015)

0 3 2 1

0 3 2 1

‣ LSTMcanlearntocountwiththerightweightmatrix

‣ Thisiseffec*velyposi*on-basedaddressing

Whatcana(en*ondo?‣ Learningtosubsampletokens

Luongetal.(2015)

0 3 2 1

3 1

‣ Needtocount(forordering)andalsodeterminewhichtokensarein/out

‣ Content-basedaddressing

A(en*on

‣ Decoderhiddenstatesarenowmostlyresponsibleforselec*ngwhattoa(endto

‣ Doesn’ttakeacomplexhiddenstatetowalkmonotonicallythroughasentenceandspitoutword-by-wordtransla*ons

‣ Encoderhiddenstatescapturecontextualsourcewordiden*ty

themoviewasgreat

BatchingA(en*on

Luongetal.(2015)

themoviewasgreat

tokenoutputs:batchsizexsentencelengthxdimension

sentenceoutputs:batchsizexhiddensize

<s>

hiddenstate:batchsizexhiddensize

eij = f(h̄i, hj)


a(en*onscores=batchsizexsentencelength

c=batchsizexhiddensize ci =X

j

↵ijhj

‣Makesuretensorsaretherightsize!

Results‣Machinetransla*on:BLEUscoreof14.0onEnglish-German->16.8witha(en*on,19.0withsmartera(en*on(constrainedtoasmallwindows)

Luongetal.(2015) Chopraetal.(2016) JiaandLiang(2016)

‣ Summariza*on/headlinegenera*on:bigramrecallfrom11%->15%

‣ Seman*cparsing:~30%accuracy->70+%accuracyonGeoquery

CopyingInput/Pointers

UnknownWords

Jeanetal.(2015),Luongetal.(2015)

‣Wanttobeabletocopynameden**eslikePont-de-Buis


froma(en*onfromRNNhiddenstate

‣ Problem:targetwordhastobeinthevocabulary,a8enPon+RNNneedtogenerategoodembeddingtopickit

PointerNetwork‣ Standarddecoder(Pvocab):sopmaxovervocabulary


Le ama*n… …

‣ Pointernetwork(Ppointer):predictfromsourcewords, insteadoftargetvocabulary

por*coinPont-de-Buis <s>

…in

Pont-de-Buis

por*co

……

… …

…

PointerGeneratorMixtureModels‣ DefinethedecodermodelasamixturemodelofPvocabandPpointer

Le ama*n

Pont-de-Buis

por*co in… …

‣ PredictP(copy)basedondecoderstate,input,etc.

v

‣Marginalizeovercopyvariableduringtrainingandinference

‣Modelwillbeabletobothgenerateandcopy,flexiblyadaptbetweenthetwo

P(copy)1-P(copy)

Gulcehreetal.(2016),Guetal.(2016)

v

Copying

{ {Lede

ma*n

Pont-de-Buisecotax

…

‣ Vocabularycontains“normal”vocabaswellaswordsininput.Normalizesoverbothofthese:

‣ Bilinearfunc*onofinputrepresenta*on+outputhiddenstate

{P (yi = w|x, y1, . . . , yi�1) /expWw[ci; h̄i]

h>j V h̄i

ifwinvocab

ifw=xj

CopyinginSummariza*on

Seeetal.(2017)


Seeetal.(2017)


Seeetal.(2017)

Transformers

41

SentenceEncoders

themoviewasgreat

‣ LSTMabstracPon:mapseachvectorinasentencetoanew,context-awarevector

‣ CNNsdosomethingsimilarwithfilters

‣ A8enPoncangiveusathirdwaytodothis

Vaswanietal.(2017)

themoviewasgreat

42

Self-A8enPon

Vaswanietal.(2017)

Theballerinaisveryexcitedthatshewilldanceintheshow.

‣ Pronounsneedtolookatantecedents‣ Ambiguouswordsshouldlookatcontext

‣ Assumewe’reusingGloVe—whatdowewantourneuralnetworktodo?

‣Whatwordsneedtobecontextualizedhere?

‣WordsshouldlookatsyntacPcparents/children

‣ Problem:LSTMsandCNNsdon’tdothis

43

Self-A8enPon

Vaswanietal.(2017)


‣Want:

‣ LSTMs/CNNs:tendtolookatlocalcontext


‣ Toappropriatelycontextualizeembeddings,weneedtopassinformaPonoverlongdistancesdynamicallyforeachword

Self-A(en*on

Vaswanietal.(2017)

themoviewasgreat

‣ Eachwordformsa“query”whichthencomputesa(en*onovereachword

‣Mul*ple“heads”analogoustodifferentconvolu*onalfilters.UseparametersWkandVktogetdifferenta(en*onvalues+transformvectors

x4

x04

scalar

vector=sumofscalar*vector

↵i,j = softmax(x>i xj)

x0i =

nX

j=1

↵i,jxj

↵k,i,j = softmax(x>i Wkxj) x0

k,i =nX

j=1

↵k,i,jVkxj

�45

Whatcanself-a8enPondo?

Vaswanietal.(2017)


‣WhymulPpleheads?Socmaxesendupbeingpeaked,singledistribuPoncannoteasilyputweightonmulPplethings

0.5 0.20.10.10.10 0 0 0 0 0 0

‣ ThisisademonstraPon,wewillrevisitwhatthesemodelsactuallylearnwhenwediscussBERT

‣ A8endnearby+tosemanPcallyrelatedterms

0.5 0 0.40 0.1 0 0 0 0 0 0 0

�46

TransformerUses

‣ Supervised:transformercanreplaceLSTMasencoder,decoder,orboth;willrevisitthiswhenwediscussMT

‣ Unsupervised:transformersworkbe8erthanLSTMforunsupervisedpre-trainingofembeddings:predictwordgivencontextwords

‣ BERT(BidirecPonalEncoderRepresentaPonsfromTransformers):pretrainingtransformerlanguagemodelssimilartoELMo

‣ Strongerthansimilarmethods,SOTAon~11tasks(includingNER—92.8F1)

Takeaways

‣ A(en*onisveryhelpfulforseq2seqmodels

‣ Usedfortasksincludingdata-to-textgenera*onandsummariza*on

‣ Explicitlycopyinginputcanbebeneficialaswell

‣ Transformersarestrongmodelswe’llcomebacktolater,if*me

Documents

Seq2Seq + A(en*on/Copy · 2020-04-12 · ‣ Encoder-decoder models like to repeat themselves: Un garçon joue dans la neige A boy plays in the snow boy plays boy plays ‣ Open a