Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels,...

AttentionDLAI– MARTAR.COSTA-JUSSÀ

SLIDESADAPTEDFROMGRAHAMNEUBIG’SLECTURES

Whatadvancementsexciteyoumostinthefield?Iamveryexcitedbytherecentlyintroducedattentionmodels,duetotheirsimplicityandduetothefactthattheyworksowell.Althoughthesemodelsarenew,Ihavenodoubtthattheyareheretostay,andthattheywillplayaveryimportantroleinthefutureofdeeplearning.

ILYASUTSKEVER, RESEARCHDIRECTORANDCOFUNDEROFOPENAI

Outline1.Sequencemodeling&Sequence-to-sequencemodels[WRAP-UPFROMPREVIOUSRNN’sSESSION]

2.Attention-basedmechanism

3.Attentionvarieties

4.AttentionImprovements

5.Applications

6.“Attentionisallyouneed”

7.Summary

SequencemodelingModeltheprobabilityofsequencesofwords

Frompreviouslecture…wemodelsequences

ithRNNs

p(I’m) p(fine|I’m) p(.|fine) EOS

I’m fine .<s>

Sequence-to-sequencemodels

how are you ?

Cómo estás EOS

encoder decoder

¿ Cómo estás

THOUGHT/CONTEXT

VECTOR

Anyproblemwiththesemodels?

2.Attention-basedmechanism

MotivationinthecaseofMT

Attention

encoder

decoder

Attention allows to use multiple vectors, based onthe length of the input

AttentionKeyIdeas•Encodeeachwordintheinputandoutputsentenceintoavector

•Whendecoding,performalinearcombinationofthesevectors,weightedby“attentionweights”

•Usethiscombinationinpickingthenextword

AttentioncomputationI•Use“query”vector(decoderstate)and“key”vectors(allencoderstates)

•Foreachquery-keypair,calculateweight

•Normalizetoaddtooneusingsoftmax

Query Vector

Key Vectors

a1=2.1 a2=-0.1 a3=0.3 a4=-1.0

softmax

a1=0.5 a2=0.3 a3=0.1 a4=0.1

AttentioncomputationII• Combinetogethervaluevectors(usuallyencoderstates,likekeyvectors)bytakingtheweightedsum

Value Vectors

a1=0.5 a2=0.3 a3=0.1 a4=0.1* * * *

AttentionScoreFunctionsqisthequeryandkisthekey

Reference

Multi-layerPerceptron

𝑎 𝑞, 𝑘 = tanh(𝒲- 𝑞, 𝑘 ) Flexible,oftenverygoodwithlargedata

Bahdanau etal.,2015

Bilinear 𝑎 𝑞, 𝑘 = 𝑞/𝒲𝑘 Luongetal2015

DotProduct 𝑎 𝑞, 𝑘 = 𝑞/𝑘 Noparameters!Butrequiressizestobethesame

Luongetal.2015

ScaledDotProduct𝑎 𝑞, 𝑘 =

𝑞/𝑘|𝑘|�

Scalebysizeofthevector Vaswani etal.2017

AttentionIntegration

3.AttentionVarieties

HardAttention*Insteadofasoftinterpolation,makeazero-onedecisionaboutwheretoattend(Xuetal.2015)

MonotonicAttentionThisapproach"softly"preventsthemodelfromassigningattentionprobabilitybeforewhereitattendedataprevioustimestep bytakingintoaccounttheattentionattheprevioustimestep.

ENCODER STATE E

Intra-Attention/Self- AttentionEachelementinthesentenceattendstootherelementsfromtheSAMEsentenceà contextsensitiveencodings!

MultipleSourcesAttendtomultiplesentences(Zoph etal.,2015)

Attendtoasentenceandanimage(Huangetal.2016)

Multi-headedAttentionIMultipleattention“heads”focusondifferentpartsofthesentence

𝑎 𝑞, 𝑘 =𝑞/𝑘|𝑘|�

Multi-headedAttentionIIMultipleattention“heads”focusondifferentpartsofthesentence

E.g.Multipleindependentlylearnedheads(Vaswani etal.2017)

𝑎 𝑞, 𝑘 =𝑞/𝑘|𝑘|�

4.ImprovementsinAttentionINTHECONTEXTOFMT

CoverageProblem:Neuralmodelstendstodroporrepeatcontent

1.Over-translation:somewordsareunnecessarilytranslatedformultipletimes;

2.Under-translation:somewordsaremistakenlyuntranslated.

SRC:Señor Presidente,abre lasesión.

TRG:Mr PresidentMr PresidentMr President.

Solution:Modelhowmanytimeswordshavebeencoverede.g.maintainingacoveragevectortokeeptrackoftheattentionhistory(Tu etal.,2016)

IncorporatingMarkovPropertiesIntuition:Attentionfromlasttimetendstobecorrelatedwithattentionthistime

Approach:Addinformationaboutthelastattentionwhenmakingthenextdecision

BidirectionalTraining-Background:Establishedthatforlatentvariabletranslationmodelsthealignmentsimproveifbothdirectionalmodelsarecombined(koehn etal,2005)

-Approach:jointtrainingoftwodirectionalmodels

SupervisedTrainingSometimeswecanget“goldstandard”alignmentsa–priori◦ Manualalignments◦ Pre-trainedwithstrongalignmentmodel

Trainthemodeltomatchthesestrongalignments

5.Applications

Chatbotsacomputerprogramthatconductsaconversation

Human: what is your job Enc-dec: i’m a lawyer Human: what do you do ?Enc-dec: i’m a doctor .

what is your job

I’m a EOS

<s> I’m a

lawyer

+attention

NaturalLanguageInference

OtherNLPTasksText summarization: process of shortening a text document withsoftware to create a summary with the major points of the originaldocument.Question Answering: automatically producing an answer to aquestion given a corresponding document.

Semantic Parsing: mapping natural language into a logical form thatcan be executed on a knowledge base and return an answer

Syntactic Parsing: process of analysing a string of symbols, either innatural language or in computer languages, conforming to the rulesof a formal grammar

ImagecaptioningIdecoder

encoder A cat on the mata cat

on the mat

cat on the

ImageCaptioningII

OtherComputerVisionTaskswithAttentionVisual Question Answering: given an image and a natural languagequestion about the image, the task is to provide an accurate naturallanguage answer.Video Caption Generation: attempts to generate a complete andnatural sentence, enriching the single label as in video classification,to capture the most informative dynamics in videos.

Speechrecognition/translation

6.“Attentionisallyouneed”SLIDESBASEDONHTTPS://RESEARCH.GOOGLEBLOG.COM/2017/08/TRANSFORMER-NOVEL-NEURAL-NETWORK.HTML

MotivationSequentialnatureofRNNs-à difficulttotakeadvantageofmoderncomputingdevicessuchasTPUs(TensorProcessingUnits)

Transformer

Iarrivedat thebankaftercrossingtheriver

TransformerIDecoder

Encoder

TransformerII

Transformerresults

Attentionweights

7.Summary

RNNsandAttentionRNNsareusedtomodelsequences

Attentionisusedtoenhancemodelinglongsequences

Versatilityofthesemodelsallowstoapplythemtoawiderangeofapplications

ImplementationsofEncoder-DecoderLSTM CNN

Attention-basedmechanismsSoftvsHard:softattentionweightsallpixels,hardattentioncropstheimageandforcesattentiononlyonthekeptpart.

GlobalvsLocal: aglobal approach whichalwaysattendstoallsourcewordsandalocalonethatonlylooksatasubsetofsourcewordsatatime.

IntravsExternal:intraattentioniswithintheencoder’sinputsentence,externalattentionisacrosssentences.

Onelargeencoder-decoder•Text,speech,image…isallconverging toasignalparadigm?

•IfyouknowhowtobuildaneuralMTsystem,youmayeasilylearnhowtobuildaspeech-to-textrecognitionsystem...

•Oryoumaytrainthemtogethertoachievezero-shot AI.

*And other references on this research direction….

Research going on… Interested?marta.ruiz@upc.edu

Quizz1.Markallstatementsthataretrue

A.Sequencemodelingonlyreferstolanguageapplications

B.Theattentionmechanismcanbeappliedtoanencoder-decoderarchitecture

C.Neuralmachinetranslationsystemsrequirerecurrentneuralnetworks

D.Ifwewanttohaveafixedrepresentation(thoughtvector),wecannotapplyattention-basedmechanisms

2.Giventhequeryvectorq=[],thekeyvector1k1=[]andthekeyvector2k2=[].

A.Whataretheattentionweights1&2computingthedotproduct?

B.Andwhencomputingthescaleddotproduct?

C.Towhatkeyvectorarewegivingmoreattention?

D.Whatistheadvantageofcomputingthescaleddotproduct?

Attention - GitHub...Attention-based mechanisms Soft vs Hard: soft attention weights all pixels,...

Documents

Soft & Hard Horn Antennas

Uncertainty in Hard, Soft and Hard-Soft Modeling

Hard vs soft HRM

Hard Soft Skill Pendididikan

Hard' n' Soft

Hard and Soft C

Hard and soft palate

Hard Soft Acids

p12x_envcc55 Hard-Soft Ware

Hard and Soft Engineering1

Loto hard or soft

Hard Soft Ref

Soft Censorship, Hard Impact

Hard&Soft IUTAC

Ticks (Soft and Hard)

Hard Choices, Soft Options

hard and soft HRM

Hard and Soft Modernism

· Booking Form MEMBERS NON MEMBERS NATIONAL DIRECTORY TORY OF ' Details Hard Copy Soft Copy Hard & Soft Copy Hard Copy Soft Copy Hard & Soft Copy

©2007 Mueller Industries, Inc. 2 TUBE & PIPE METAL FITTINGS & NIPPLES PLASTIC FITTINGS VALVES FILTER-DRIERS & STRAINERS NOM Hard Soft Soft Hard Soft Soft Hard Hard Type K Type L