52
Attention DLAI – MARTA R. COSTA-JUSSÀ SLIDES ADAPTED FROM GRAHAM NEUBIG’S LECTURES

Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

AttentionDLAI– MARTAR.COSTA-JUSSÀ

SLIDESADAPTEDFROMGRAHAMNEUBIG’SLECTURES

Page 2: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Whatadvancementsexciteyoumostinthefield?Iamveryexcitedbytherecentlyintroducedattentionmodels,duetotheirsimplicityandduetothefactthattheyworksowell.Althoughthesemodelsarenew,Ihavenodoubtthattheyareheretostay,andthattheywillplayaveryimportantroleinthefutureofdeeplearning.

ILYASUTSKEVER, RESEARCHDIRECTORANDCOFUNDEROFOPENAI

2

Page 3: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Outline1.Sequencemodeling&Sequence-to-sequencemodels[WRAP-UPFROMPREVIOUSRNN’sSESSION]

2.Attention-basedmechanism

3.Attentionvarieties

4.AttentionImprovements

5.Applications

6.“Attentionisallyouneed”

7.Summary

3

Page 4: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

SequencemodelingModeltheprobabilityofsequencesofwords

Frompreviouslecture…wemodelsequences

ithRNNs

p(I’m) p(fine|I’m) p(.|fine) EOS

I’m fine .<s>

4

Page 5: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Sequence-to-sequencemodels

how are you ?

Cómo estás EOS

encoder decoder

¿ Cómo estás

?

?

¿

<s>

THOUGHT/CONTEXT

VECTOR

5

Page 6: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Anyproblemwiththesemodels?

6

Page 7: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

7

Page 8: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

2.Attention-basedmechanism

8

Page 9: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

MotivationinthecaseofMT

9

Page 10: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

MotivationinthecaseofMT

10

Page 11: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Attention

encoder

decoder

+

Attention allows to use multiple vectors, based onthe length of the input

11

Page 12: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

AttentionKeyIdeas•Encodeeachwordintheinputandoutputsentenceintoavector

•Whendecoding,performalinearcombinationofthesevectors,weightedby“attentionweights”

•Usethiscombinationinpickingthenextword

12

Page 13: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

AttentioncomputationI•Use“query”vector(decoderstate)and“key”vectors(allencoderstates)

•Foreachquery-keypair,calculateweight

•Normalizetoaddtooneusingsoftmax

Query Vector

Key Vectors

a1=2.1 a2=-0.1 a3=0.3 a4=-1.0

softmax

a1=0.5 a2=0.3 a3=0.1 a4=0.1

13

Page 14: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

AttentioncomputationII• Combinetogethervaluevectors(usuallyencoderstates,likekeyvectors)bytakingtheweightedsum

Value Vectors

a1=0.5 a2=0.3 a3=0.1 a4=0.1* * * *

14

Page 15: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

AttentionScoreFunctionsqisthequeryandkisthekey

Reference

Multi-layerPerceptron

𝑎 𝑞, 𝑘 = tanh(𝒲- 𝑞, 𝑘 ) Flexible,oftenverygoodwithlargedata

Bahdanau etal.,2015

Bilinear 𝑎 𝑞, 𝑘 = 𝑞/𝒲𝑘 Luongetal2015

DotProduct 𝑎 𝑞, 𝑘 = 𝑞/𝑘 Noparameters!Butrequiressizestobethesame

Luongetal.2015

ScaledDotProduct𝑎 𝑞, 𝑘 =

𝑞/𝑘|𝑘|�

Scalebysizeofthevector Vaswani etal.2017

15

Page 16: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

AttentionIntegration

16

Page 17: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

AttentionIntegration

17

Page 18: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

3.AttentionVarieties

18

Page 19: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

HardAttention*Insteadofasoftinterpolation,makeazero-onedecisionaboutwheretoattend(Xuetal.2015)

19

Page 20: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

MonotonicAttentionThisapproach"softly"preventsthemodelfromassigningattentionprobabilitybeforewhereitattendedataprevioustimestep bytakingintoaccounttheattentionattheprevioustimestep.

20

ENCODER STATE E

Page 21: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Intra-Attention/Self- AttentionEachelementinthesentenceattendstootherelementsfromtheSAMEsentenceà contextsensitiveencodings!

21

Page 22: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

MultipleSourcesAttendtomultiplesentences(Zoph etal.,2015)

Attendtoasentenceandanimage(Huangetal.2016)

22

Page 23: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Multi-headedAttentionIMultipleattention“heads”focusondifferentpartsofthesentence

𝑎 𝑞, 𝑘 =𝑞/𝑘|𝑘|�

23

Page 24: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Multi-headedAttentionIIMultipleattention“heads”focusondifferentpartsofthesentence

E.g.Multipleindependentlylearnedheads(Vaswani etal.2017)

𝑎 𝑞, 𝑘 =𝑞/𝑘|𝑘|�

24

Page 25: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

4.ImprovementsinAttentionINTHECONTEXTOFMT

25

Page 26: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

CoverageProblem:Neuralmodelstendstodroporrepeatcontent

InMT,

1.Over-translation:somewordsareunnecessarilytranslatedformultipletimes;

2.Under-translation:somewordsaremistakenlyuntranslated.

SRC:Señor Presidente,abre lasesión.

TRG:Mr PresidentMr PresidentMr President.

Solution:Modelhowmanytimeswordshavebeencoverede.g.maintainingacoveragevectortokeeptrackoftheattentionhistory(Tu etal.,2016)

26

Page 27: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

IncorporatingMarkovPropertiesIntuition:Attentionfromlasttimetendstobecorrelatedwithattentionthistime

Approach:Addinformationaboutthelastattentionwhenmakingthenextdecision

27

Page 28: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

BidirectionalTraining-Background:Establishedthatforlatentvariabletranslationmodelsthealignmentsimproveifbothdirectionalmodelsarecombined(koehn etal,2005)

-Approach:jointtrainingoftwodirectionalmodels

28

Page 29: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

SupervisedTrainingSometimeswecanget“goldstandard”alignmentsa–priori◦ Manualalignments◦ Pre-trainedwithstrongalignmentmodel

Trainthemodeltomatchthesestrongalignments

29

Page 30: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

5.Applications

30

Page 31: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Chatbotsacomputerprogramthatconductsaconversation

Human: what is your job Enc-dec: i’m a lawyer Human: what do you do ?Enc-dec: i’m a doctor .

what is your job

I’m a EOS

<s> I’m a

lawyer

lawyer

+attention

31

Page 32: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

NaturalLanguageInference

32

Page 33: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

OtherNLPTasksText summarization: process of shortening a text document withsoftware to create a summary with the major points of the originaldocument.Question Answering: automatically producing an answer to aquestion given a corresponding document.

Semantic Parsing: mapping natural language into a logical form thatcan be executed on a knowledge base and return an answer

Syntactic Parsing: process of analysing a string of symbols, either innatural language or in computer languages, conforming to the rulesof a formal grammar

33

Page 34: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

ImagecaptioningIdecoder

encoder A cat on the mata cat

<s> a

on the mat

cat on the

34

Page 35: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

ImageCaptioningII

35

Page 36: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

OtherComputerVisionTaskswithAttentionVisual Question Answering: given an image and a natural languagequestion about the image, the task is to provide an accurate naturallanguage answer.Video Caption Generation: attempts to generate a complete andnatural sentence, enriching the single label as in video classification,to capture the most informative dynamics in videos.

36

Page 37: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Speechrecognition/translation

37

Page 38: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

6.“Attentionisallyouneed”SLIDESBASEDONHTTPS://RESEARCH.GOOGLEBLOG.COM/2017/08/TRANSFORMER-NOVEL-NEURAL-NETWORK.HTML

38

Page 39: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

MotivationSequentialnatureofRNNs-à difficulttotakeadvantageofmoderncomputingdevicessuchasTPUs(TensorProcessingUnits)

39

Page 40: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Transformer

Iarrivedat thebankaftercrossingtheriver

40

Page 41: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

TransformerIDecoder

Encoder

41

Page 42: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

TransformerII

42

Page 43: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Transformerresults

43

Page 44: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Attentionweights

44

Page 45: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Attentionweights

45

Page 46: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

7.Summary

46

Page 47: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

RNNsandAttentionRNNsareusedtomodelsequences

Attentionisusedtoenhancemodelinglongsequences

Versatilityofthesemodelsallowstoapplythemtoawiderangeofapplications

47

Page 48: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

ImplementationsofEncoder-DecoderLSTM CNN

48

Page 49: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Attention-basedmechanismsSoftvsHard:softattentionweightsallpixels,hardattentioncropstheimageandforcesattentiononlyonthekeptpart.

GlobalvsLocal: aglobal approach whichalwaysattendstoallsourcewordsandalocalonethatonlylooksatasubsetofsourcewordsatatime.

IntravsExternal:intraattentioniswithintheencoder’sinputsentence,externalattentionisacrosssentences.

49

Page 50: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Onelargeencoder-decoder•Text,speech,image…isallconverging toasignalparadigm?

•IfyouknowhowtobuildaneuralMTsystem,youmayeasilylearnhowtobuildaspeech-to-textrecognitionsystem...

•Oryoumaytrainthemtogethertoachievezero-shot AI.

*And other references on this research direction….

50

Page 51: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

51

Research going on… [email protected]

Q&A?

Page 52: Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels, hard attention crops the image and forces attention only on the kept part. Global vs

Quizz1.Markallstatementsthataretrue

A.Sequencemodelingonlyreferstolanguageapplications

B.Theattentionmechanismcanbeappliedtoanencoder-decoderarchitecture

C.Neuralmachinetranslationsystemsrequirerecurrentneuralnetworks

D.Ifwewanttohaveafixedrepresentation(thoughtvector),wecannotapplyattention-basedmechanisms

2.Giventhequeryvectorq=[],thekeyvector1k1=[]andthekeyvector2k2=[].

A.Whataretheattentionweights1&2computingthedotproduct?

B.Andwhencomputingthescaleddotproduct?

C.Towhatkeyvectorarewegivingmoreattention?

D.Whatistheadvantageofcomputingthescaleddotproduct?

52