28
Attention Is All You Need (Vaswani et al. 2017) Popularized self-attention Created the general-purpose Transformer architecture for sequence modeling Demonstrated computational savings over other models

Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

Attention Is All You Need (Vaswani et al. 2017)

• Popularized self-attention

• Created the general-purpose Transformer architecture for sequence modeling

• Demonstrated computational savings over other models

Page 2: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

Transformers: High-Level

• Sequence-to-sequence model with encoder and decoder

Encoder Decoder

Page 3: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

Attention as Representations

• Attention generally used to scoreexisting encoder representations

• Why not use them asrepresentations?

This movie rocks !

Page 4: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

Self-Attention

• Every element sees itself in its context

• Attention weight corresponds to an “important” signal

This movie rocks !

This movie rocks !

Page 5: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

Self-Attention: Formalized

• Score the energy between a query Q and key K → scalar

• Use softmax-ed energy to take a weighted average of value V→ scalar * vector

Page 6: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

Self-Attention: Example

• Score “she” (Q) against “Susan” and “the” (both K and V) in “Susan dropped the plate. She is clumsy”

she

she

the

Susan

0.3

0.7

the

Susan

Page 7: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

Masked Self-Attention

• Modeling temporality requires enforcing causal relationships

• Mask out illegal connections in self-attention map

shewent

tothe

store

she

wen

t to the

stor

e

Page 8: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

Multi-Head Self-Attention

• Problem: Self-attention is just a weighted average; how do we model complex relationships?

We went to the store at 7pm

Self-Attention

Q K V

Page 9: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

We went to the store at 7pm

Multi-Head Self-Attention

• Solution: Use multiple self-attention heads!

Self-Attention

Q K V Q K VQ K V

Page 10: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

Position-wise FFNN

• Feed-forward network mixes multi-head self-attention by operating on each position independently

we

went

Head 1Head 2Head 3

Head 1Head 2Head 3

FFNN

FFNN

hidden

hidden

sequence lengthx hidden dim

Page 11: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

Positional Embeddings

• No convolutions or recurrence; use sinusoids to inject positional information into the model• Embedding is a function of position and dimension

Page 12: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

Transformer: Full Model

Vaswani et al. (2017)

Page 13: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

Results: MT

Vaswani et al. (2017)

Page 14: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

Results: Constituency Parsing

Vaswani et al. (2017)

Page 15: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

Why Transformers?

• Self-attention is flexible

• Highly modular and extensible

• Demonstrated empirical performance

Page 16: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al. 2019)

• Deep bidirectional Transformer architecture for (masked) language modeling

• Advances SOTA on 11 NLP tasks including GLUE, MNLI, and SQuAD

Page 17: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

Background: ELMo

• Revitalized research in pretraining: creating unsupervised tasks from large unlabeled corpora (e.g., word2vec)

this movie rocks !

Char CNN Char CNN Char CNN Char CNN wordembedding

forwardLSTM

contextualembedding

backwardLSTM

Page 18: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

BERT• Deeply bidirectional as opposed to ELMo (only a shallow

concatenation of LMs)

• Introduces two pretraining tasks:• Masked Language Modeling• Next Sentence Prediction

Page 19: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

Pretraining: Masked Language Modeling

• Problem: bidirectional language modeling not possible as each token “sees” itself in context

• Solution: introduce a cloze-style task where the model tries to predict the missing word ([MASK])

we went to the store

went to the storeOutputs

Inputs

[MASK] went [MASK] the store

Page 20: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

Pretraining: Next Sentence Prediction

• To learn inter-sentential relationships, determine if sentence B followssentence A; randomly sample sentence B

I went to the store at 7pm. The store had lots of fruit!

Selena Gomez is an American singer. Variational autoencoders are cool.

Sentence A Sentence B

Sentence A Sentence B

Page 21: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

BERT: Inputs

Devlin et al. (2019)

Page 22: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

BERT: Pretraining

BERT

[CLS] [SEP] [SEP]Segment A Segment B

NSP MASK MASK MASK MASK

Sentence representations are stored in [CLS]

Bidirectional representations are used to predict [MASK]

Page 23: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

BERT: Fine-Tuning

BERT

[CLS] [SEP] [SEP]Premise Hypothesis

Features MLP

Entailment

Contradiction

Neutral

0.8

0.05

0.15

Page 24: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

Results: GLUE

Devlin et al. (2019)

Page 25: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

Results: SQuAD

Devlin et al. (2019)

Page 26: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

Ablation: Pretraining Tasks

• No NSP: BERT trained without next sentence prediction• LTR & No NSP: Regular LM without next sentence prediction

Page 27: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

Ablation: Model Size

• Increasing capacity consistently increases capacity; also consistent with future work (e.g., GPT-2, RoBERTa, etc.)

Page 28: Attention Is All You Need (Vaswani et al. 2017)shreyd/talks/transformers.pdfAttention Is All You Need (Vaswani et al. 2017) •Popularized self-attention •Created the general-purpose

2019: The Year of Pretraining

GPT-2 XLM XLNet RoBERTa

ELECTRA ALBERT T5 BART