86
Better Conditional Language Modeling Chris Dyer

Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Better Conditional Language Modeling

Chris Dyer

Page 2: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

A conditional language model assigns probabilities to sequences of words, , given some conditioning context, .

w = (w1, w2, . . . , w`)

Conditional LMs

p(w | x) =Y

t=1

p(wt | x, w1, w2, . . . , wt�1)

As with unconditional models, it is again helpful to use the chain rule to decompose this probability:

What is the probability of the next word, given the history of previously generated words and conditioning context ?

x

x

Page 3: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Kalchbrenner and Blunsom 2013

c = embed(x)

s = Vc

Encoder

Recurrent decoderSource sentence

Embedding of wt�1

Recurrent connection

Recall unconditional RNNht = g(W[ht�1;wt�1] + b])

Learnt biasht = g(W[ht�1;wt�1] + s+ b])

ut = Pht + b0

p(Wt | x,w<t) = softmax(ut)

Page 4: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

softmax

p1

h1

h0 x1

<s>

s

tom

p(tom | s, hsi)

K&B 2013: RNN Decoder

h2

softmax

x2

⇠likes

⇥p(likes | s, hsi, tom)

x3

h3

softmax

beer

⇥p(beer | s, hsi, tom, likes)

x4

h4

softmax

</s>

⇥p(h\si | s, hsi, tom, likes, beer)

Page 5: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

K&B 2013: EncoderHow should we define ?c = embed(x)

The simplest model possible:

x1

x1 x2 x3 x4 x5 x6

x2 x3 x4 x5 x6

c =X

i

xi

Page 6: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

• The bag of words assumption is really bad (part 1) Alice saw Bob. Bob saw Alice. I would like some fresh bread with aged cheese. I would like some aged bread with fresh cheese.

• We are putting a lot of information inside a single vector (part 2)

K&B 2013: Problems

Page 7: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Sutskever et al. (2014)LSTM encoder

LSTM decoder

(c0,h0) are parameters

The encoding is where .

w0 = hsi

(ci,hi) = LSTM(xi, ci�1,hi�1)

(ct+`,ht+`) = LSTM(wt�1, ct+`�1,ht+`�1)

(c`,h`) ` = |x|

ut = Pht+` + b

p(Wt | x,w<t) = softmax(ut)

Page 8: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Aller Anfang ist schwer STOP

START

c

Sutskever et al. (2014)

Page 9: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Aller Anfang ist schwer STOP

START

Beginnings

c

Sutskever et al. (2014)

Page 10: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Aller Anfang ist schwer

are

STOP

START

Beginnings

c

Sutskever et al. (2014)

Page 11: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Aller Anfang ist schwer

are

STOP

START

difficultBeginnings

c

Sutskever et al. (2014)

Page 12: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Aller Anfang ist schwer

are

STOP

START

difficult STOPBeginnings

c

Sutskever et al. (2014)

Page 13: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Sutskever et al. (2014)

• Good

• RNNs deal naturally with sequences of various lengths

• LSTMs in principle can propagate gradients a long distance

• Very simple architecture!

• Bad

• The hidden state has to remember a lot of information!

Page 14: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Aller Anfang ist schwer

are

STOP

START

difficult STOPBeginnings

c

Sutskever et al. (2014): Tricks

Page 15: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Aller Anfang ist schwer

are

STOP

START

difficult STOPBeginnings

c

Sutskever et al. (2014): TricksRead the input sequence “backwards”: +4 BLEU

Page 16: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Sutskever et al. (2014): Tricks

Ensemble of 2 models: +3 BLEU

Decoder:

u(j)t = Ph(j)

t + b(j)

ut =1

J

JX

j0=1

u(j0)

p(Wt | x,w<t) = softmax(ut)

Ensemble of 5 models: +4.5 BLEU

Use an ensemble of J independently trained models.

(c(j)t+`,h(j)t+`) = LSTM(j)(wt�1, c

(j)t+`�1,h

(j)t+`�1)

Page 17: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Sutskever et al. (2014): Tricks

Use beam search: +1 BLEU

hsilogprob=0

logprob=-2.11

x = Bier trinke ichbeer drink I

logprob=-1.82beer

logprob=-8.66

logprob=-2.87

logprob=-6.93

logprob=-5.80

logprob=-3.04

logprob=-5.12

logprob=-6.28

logprob=-7.31

I

drink

I

drink

beer beer

wine

drink

like

w0 w1 w2 w3

Page 18: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Sutskever et al. (2014): Tricks

Use beam search: +1 BLEUMake the beam really big: -1 BLEU

(Koehn and Knowles, 2017)

Page 19: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Conditioning with vectors

We are compressing a lot of information in a finite-sized vector.

Page 20: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Conditioning with vectors

We are compressing a lot of information in a finite-sized vector.

“You can't cram the meaning of a whole %&!$#sentence into a single $&!#* vector!”

Prof. Ray Mooney

Page 21: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

We are compressing a lot of information in a finite-sized vector.

Gradients have a long way to travel. Even LSTMs forget!

Conditioning with vectors

Page 22: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

We are compressing a lot of information in a finite-sized vector.

Gradients have a long way to travel. Even LSTMs forget!

Conditioning with vectors

What is to be done?

Page 23: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

• Represent a source sentence as a matrix

• Generate a target sentence from a matrix

• This will

• Solve the capacity problem

• Solve the gradient flow problem

Solving the vector bottleneck

Page 24: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

• Problem with the fixed-size vector model

• Sentences are of different sizes but vectors are of the same size

• Solution: use matrices instead

• Fixed number of rows, but number of columns depends on the number of words

• Usually |f| = #cols

Sentences as vectors matrices

Page 25: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Ich mochte ein Bier

Sentences as matrices

Page 26: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Ich mochte ein Bier Mach’s gut

Sentences as matrices

Page 27: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Ich mochte ein Bier Mach’s gut Die Wahrheiten der Menschen sind die unwiderlegbaren Irrtumer

Sentences as matrices

Page 28: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Ich mochte ein Bier Mach’s gut Die Wahrheiten der Menschen sind die unwiderlegbaren Irrtumer

Question: How do we build these matrices?

Sentences as matrices

Page 29: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

• Each word type is represented by an n-dimensional vector

• Take all of the vectors for the sentence and concatenate them into a matrix

• Simplest possible model

• So simple, no one has bothered to publish how well/badly it works!

Sentences as matricesWith concatenation

Page 30: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Ich möchte ein Bier

x1 x2 x3 x4

Page 31: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

fi = xi

Ich möchte ein Bier

x1 x2 x3 x4

Page 32: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Ich mochte ein Bier

fi = xi

F 2 Rn⇥|f |

Ich möchte ein Bier

x1 x2 x3 x4

Page 33: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

• By far the most widely used matrix representation, due to Bahdanau et al (2015)

• One column per word

• Each column (word) has two halves concatenated together:

• a “forward representation”, i.e., a word and its left context

• a “reverse representation”, i.e., a word and its right context

• Implementation: bidirectional RNNs (GRUs or LSTMs) to read f from left to right and right to left, concatenate representations

Sentences as matricesWith bidirectional RNNs

Page 34: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Ich möchte ein Bier

x1 x2 x3 x4

Page 35: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Ich möchte ein Bier

x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

Page 36: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Ich möchte ein Bier

x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

�h 1

�h 2

�h 3

�h 4

Page 37: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Ich möchte ein Bier

x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

�h 1

�h 2

�h 3

�h 4

fi = [ �h i;�!h i]

Page 38: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Ich möchte ein Bier

x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

�h 1

�h 2

�h 3

�h 4

fi = [ �h i;�!h i]

Page 39: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Ich möchte ein Bier

x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

�h 1

�h 2

�h 3

�h 4

fi = [ �h i;�!h i]

Page 40: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Ich möchte ein Bier

x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

�h 1

�h 2

�h 3

�h 4

fi = [ �h i;�!h i]

Page 41: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Ich möchte ein Bier

x1 x2 x3 x4

�!h 1

�!h 2

�!h 3

�!h 4

�h 1

�h 2

�h 3

�h 4

Ich mochte ein Bier

F 2 R2n⇥|f |

fi = [ �h i;�!h i]

Page 42: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

• There are lots of ways to construct F

• More exotic architectures coming out daily

• Increasingly common goal: get rid of O(|f|) sequential processing steps, i.e., RNNs during training

• syntactic information can help (Sennrich & Haddow, 2016; Nadejde et al., 2017), but many more integration strategies are possible

• try something with phrase types instead of word types?

Multi-word expressions are a pain in the neck .

Sentences as matricesWhere are we in 2018?

Page 43: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

• We have a matrix F representing the input, now we need to generate from it

• Bahdanau et al. (2015) and Luong et al. (2015) concurrently proposed using attention for translating from matrix-encoded sentences

• High-level idea

• Generate the output sentence word by word using an RNN

• At each output position t, the RNN receives two inputs (in addition to any recurrent inputs)

• a fixed-size vector embedding of the previously generated output symbol et-1

• a fixed-size vector encoding a “view” of the input matrix

• How do we get a fixed-size vector from a matrix that changes over time?

• Bahdanau et al: do a weighted sum of the columns of F (i.e., words) based on how important they are at the current time step. (i.e., just a matrix-vector product Fat)

• The weighting of the input columns at each time-step (at) is called attention

Conditioning on matrices

Page 44: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Recall RNNs…

Page 45: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Recall RNNs…

Page 46: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Recall RNNs…

Page 47: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

I'd

Recall RNNs…

Page 48: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

I'd

→ I'd

Recall RNNs…

Page 49: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

I'd

like

I'd

Recall RNNs…

Page 50: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Page 51: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Ich mochte ein Bier

Page 52: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Ich mochte ein Bier

Page 53: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Ich mochte ein Bier

Attention history:a>1

a>2

a>3

a>4

a>5

Page 54: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

c1 = Fa1

Page 55: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

c1 = Fa1

Page 56: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

I'd

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

c1 = Fa1

Page 57: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

I'd

Ich mochte ein Bier

Attention history:a>1

a>2

a>3

a>4

a>5

I'd

Page 58: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

I'd

Ich mochte ein Bier

Attention history:a>1

a>2

a>3

a>4

a>5

I'd

Page 59: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

I'd

I'd →

Ich mochte ein Bier

Attention history:a>1

a>2

a>3

a>4

a>5

Page 60: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

I'd

I'd →

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

c2 = Fa2

Page 61: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

I'd

I'd →

like

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

c2 = Fa2

Page 62: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

I'd

I'd →

like

like

a

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

Page 63: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

I'd

I'd →

like

like

a

a

beer

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

Page 64: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

I'd

I'd →

like

like

a

a

beer

beer

stopSTOP

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

Page 65: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

• How do we know what to attend to at each time-step?

• That is, how do we compute ?at

Attention

Page 66: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

• At each time step (one time step = one output word), we want to be able to “attend” to different words in the source sentence

• We need a weight for every column: this is an |f|-length vector at

• Here is a simplified version of Bahdanau et al.’s solution

• Use an RNN to predict model output, call the hidden states

• At time t compute the expected input embedding

• Take the dot product with every column in the source matrix to compute the attention energy.

• Exponentiate and normalize to 1:

• Finally, the input source vector for time t is

at = softmax(ut)

ct = Fat

(called in the paper)

(called in the paper)

st

ut = F>rt

rt = Vst�1( is a learned parameter)V

et

↵t

(Since F has |f| columns, has |f| rows)ut

(st has a fixed dimensionality, call it m)

Computing attention

Page 67: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

• At each time step (one time step = one output word), we want to be able to “attend” to different words in the source sentence

• We need a weight for every column: this is an |f|-length vector at

• Here is a simplified version of Bahdanau et al.’s solution

• Use an RNN to predict model output, call the hidden states

• At time t compute the query key embedding

• Take the dot product with every column in the source matrix to compute the attention energy.

• Exponentiate and normalize to 1:

• Finally, the input source vector for time t is

at = softmax(ut)

ct = Fat

(called in the paper)

(called in the paper)

st

ut = F>rt

rt = Vst�1( is a learned parameter)V

et

↵t

(Since F has |f| columns, has |f| rows)ut

(st has a fixed dimensionality, call it m)

Computing attention

Page 68: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

• At each time step (one time step = one output word), we want to be able to “attend” to different words in the source sentence

• We need a weight for every column: this is an |f|-length vector at

• Here is a simplified version of Bahdanau et al.’s solution

• Use an RNN to predict model output, call the hidden states

• At time t compute the query key embedding

• Take the dot product with every column in the source matrix to compute the attention energy.

• Exponentiate and normalize to 1:

• Finally, the input source vector for time t is

at = softmax(ut)

ct = Fat

(called in the paper)

(called in the paper)

st

ut = F>rt

rt = Vst�1( is a learned parameter)V

et

↵t

(Since F has |f| columns, has |f| rows)ut

(st has a fixed dimensionality, call it m)

Computing attention

Page 69: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

• At each time step (one time step = one output word), we want to be able to “attend” to different words in the source sentence

• We need a weight for every column: this is an |f|-length vector at

• Here is a simplified version of Bahdanau et al.’s solution

• Use an RNN to predict model output, call the hidden states

• At time t compute the query key embedding

• Take the dot product with every column in the source matrix to compute the attention energy.

• Exponentiate and normalize to 1:

• Finally, the input source vector for time t is

at = softmax(ut)

ct = Fat

(called in the paper)

(called in the paper)

st

ut = F>rt

rt = Vst�1( is a learned parameter)V

et

↵t

(Since F has |f| columns, has |f| rows)ut

(st has a fixed dimensionality, call it m)

Computing attention

Page 70: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

• At each time step (one time step = one output word), we want to be able to “attend” to different words in the source sentence

• We need a weight for every column: this is an |f|-length vector at

• Here is a simplified version of Bahdanau et al.’s solution

• Use an RNN to predict model output, call the hidden states

• At time t compute the query key embedding

• Take the dot product with every column in the source matrix to compute the attention energy.

• Exponentiate and normalize to 1:

• Finally, the input source vector for time t is

at = softmax(ut)

ct = Fat

(called in the paper)

(called in the paper)

st

ut = F>rt

rt = Vst�1( is a learned parameter)V

et

↵t

(Since F has |f| columns, has |f| rows)ut

(st has a fixed dimensionality, call it m)

Computing attention

Page 71: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

• In the actual model, Bahdanau et al. replace the dot product between the columns of F and rt with an MLP:

• Here, W and v are learned parameters of appropriate dimension and + “broadcasts” over the |f| columns in WF

• This can learn more complex interactions

• It is unclear if the added complexity is necessary for good performance

ut = F>rt

ut = tanh (WF+ rt)v

(simple model)(Bahdanau et al)

Computing attention

Page 72: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

• In the actual model, Bahdanau et al. replace the dot product between the columns of F and rt with an MLP:

• Here, W and v are learned parameters of appropriate dimension and + “broadcasts” over the |f| columns in WF

• This can learn more complex interactions

• It is unclear if the added complexity is necessary for good performance

ut = F>rt (simple model)(Bahdanau et al)ut = v> tanh(WF+ rt)

Nonlinear additive attention modelComputing attention

Page 73: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

• In the actual model, Bahdanau et al. replace the dot product between the columns of F and rt with an MLP:

• Here, W and v are learned parameters of appropriate dimension and + “broadcasts” over the |f| columns in WF

• This can learn more complex interactions

• It is unclear if the added complexity is necessary for good performance

ut = F>rt (simple model)(Bahdanau et al)ut = v> tanh(WF+ rt)

Nonlinear additive attention modelComputing attention

Page 74: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

e0 = hsi

while et 6= h/si :

F = EncodeAsMatrix(f) (Part 1 of lecture)

rt = Vst�1

s0 = w (Learned initial state; Bahdanau uses )U �h 1

at = softmax(ut)ct = Fatst = RNN(st�1, [et�1; ct])

yt = softmax(Pst + b)

et | e<t ⇠ Categorical(yt)

}(Compute attention; part 2 of lecture)

t = 0

t = t+ 1

( is a learned embedding of )et�1 et

( and are learned parameters)P b

ut = v> tanh(WF+ rt)

Putting it all together

Page 75: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

e0 = hsi

while et 6= h/si :

F = EncodeAsMatrix(f) (Part 1 of lecture)

rt = Vst�1

s0 = w (Learned initial state; Bahdanau uses )U �h 1

at = softmax(ut)ct = Fatst = RNN(st�1, [et�1; ct])

yt = softmax(Pst + b)

et | e<t ⇠ Categorical(yt)

}(Compute attention; part 2 of lecture)

t = 0

t = t+ 1

( is a learned embedding of )et�1 et

( and are learned parameters)P b

doesn’t depend on output decisionsut = v> tanh(WF+ rt)

Putting it all together

Page 76: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

e0 = hsi

while et 6= h/si :

F = EncodeAsMatrix(f) (Part 1 of lecture)

rt = Vst�1

s0 = w (Learned initial state; Bahdanau uses )U �h 1

at = softmax(ut)ct = Fatst = RNN(st�1, [et�1; ct])

yt = softmax(Pst + b)

et | e<t ⇠ Categorical(yt)

}(Compute attention; part 2 of lecture)

t = 0

t = t+ 1

( is a learned embedding of )et�1 et

( and are learned parameters)P b

X = WF

Xut = v> tanh(WF+ rt)

Putting it all together

Page 77: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

e0 = hsi

while et 6= h/si :

F = EncodeAsMatrix(f) (Part 1 of lecture)

rt = Vst�1

s0 = w (Learned initial state; Bahdanau uses )U �h 1

at = softmax(ut)ct = Fatst = RNN(st�1, [et�1; ct])

yt = softmax(Pst + b)

et | e<t ⇠ Categorical(yt)

}(Compute attention; part 2 of lecture)

t = 0

t = t+ 1

( is a learned embedding of )et�1 et

( and are learned parameters)P b

X = WF

ut = v> tanh(X+ rt)

Putting it all together

Page 78: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Add attention to seq2seq translation: +11 BLEU

Attention in MT: Results

Page 79: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently
Page 80: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

A word about gradients

Page 81: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

I'd

I'd →

like

like

a

a

beer

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

Page 82: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

I'd

I'd →

like

like

a

a

beer

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

Page 83: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

I'd

I'd →

like

like

a

a

beer

Ich mochte ein Bier

z }| {Attention history:

a>1

a>2

a>3

a>4

a>5

Page 84: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

• Cho’s question: does a translator read and memorize the input sentence/document and then generate the output?

• Compressing the entire input sentence into a vector basically says “memorize the sentence”

• Common sense experience says translators refer back and forth to the input. (also backed up by eye-tracking studies)

• Should humans be a model for machines?

Attention and translation

Page 85: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

• Attention

• provides the ability to establish information flow directly from distant

• closely related to “pooling” operations in convnets (and other architectures)

• Traditional attention model seems to only cares about “content”

• No obvious bias in favor of diagonals, short jumps, fertility, etc.

• Some work has begun to add other “structural” biases (Luong et al., 2015; Cohn et al., 2016), but there are lots more opportunities

• Factorization into keys and values (Miller et al., 2016; Ba et al., 2016, Gulcehre et al., 2016)

• Attention weights provide interpretation you can look at

Summary

Page 86: Better Conditional Language Modeling...•We have a matrix F representing the input, now we need to generate from it • Bahdanau et al. (2015) and Luong et al. (2015) concurrently

Questions?