Translation with recurrent and attention models
x1
x2
x3
x4
x5
x
y1
y2
y3
W y
y = x
>W
y = g(x>W)
g(u)i =expuiPi0 expui0
“Soft max”
“Neural” Networks
What about probabilities?Let each be an outcome.yi
x1
x2
x3
x4
x5
x
y1
y2
y3
W y
“Deep”
V z
z1
z2
z3
z4
z = g(y>V)
z = g(h(x>W)>V)
z = g(Vh(Wx))
Note:if g(x) = h(x) = x
z = (VW)| {z }U
x
“Soft max” s(u)i =expuiPi0 expui0
Bengio et al. (2003)
p(e) =
|e|Y
i=1
p(ei | ei�n+1, . . . , ei�1)
p(ei | ei�n+1, . . . , ei�1) =
ei
ei�1
ei�2
ei�3
C
C
C
W V
tanhsoftmax
x=x
1 BasicsNeural NetworkComputation GraphWhy Deep is Hard2006 Breakthrough
2 Building BlocksRBMAuto-EncodersRecurrent UnitsConvolution
3 TricksOptimizationRegularizationInfrastructure
4 New Stu↵Encoder-DecoderAttention/MemoryDeep ReinforcementVariationalAuto-Encoder
Training Neural Nets: Back-propagation
x1 x2 x3 x4
h1 h2 h3
y
xi
wij
hj
wj
Predict f (x (m))
Adjust weights
w11 w12
w1 w2 w3
1. For each sample, compute
f (x (m)) = �(P
j wj · �(P
i wijx(m)i ))
2. If f (x (m)) 6= y (m), back-propagate error and adjustweights {wij ,wj}.
(p.14)
1 BasicsNeural NetworkComputation GraphWhy Deep is Hard2006 Breakthrough
2 Building BlocksRBMAuto-EncodersRecurrent UnitsConvolution
3 TricksOptimizationRegularizationInfrastructure
4 New Stu↵Encoder-DecoderAttention/MemoryDeep ReinforcementVariationalAuto-Encoder
Derivatives of the weightsAssume two outputs (y1, y2) per input x ,and loss per sample: Loss =
Pk
12 [�(ink) � yk ]
2
x1 x2 x3 x4
h1 h2 h3
y1 y2
xi
wij
hj
wjk
yk
@Loss@wjk
= @Loss@ink
@ink@wjk
= �k@(
Pj wjkhj )
@wjk= �khj
@Loss@wij
= @Loss@inj
@inj@wij
= �j@(
Pj wij xi )
@wij= �jxi
�k = @@ink
⇣Pk
12 [�(ink) � yk ]
2⌘= [�(ink) � yk ]�0(ink)
�j =P
k@Loss@ink
@ink@inj
=P
k �k · @@inj
⇣Pj wjk�(inj)
⌘= [
Pk �kwjk ]�0(inj)
(p.15)
�0(inj)
@Loss
@wij
Since the derivatives and partials of each function are known, this is fully automatable (Torch, Theano, Tensorflow, etc.)
LMs in MTp(e, f) = pLM (e)⇥ pTM (f | e)
LMs in MTp(e, f) = pLM (e)⇥ pTM (f | e)
=
|e|Y
i=1
p(ei | ei�1, ..., e1)⇥|f |Y
j=1
p(fj | fj�1, ..., f1, e)
LMs in MTp(e, f) = pLM (e)⇥ pTM (f | e)
=
|e|Y
i=1
p(ei | ei�1, ..., e1)⇥|f |Y
j=1
p(fj | fj�1, ..., f1, e)
Conditional language model
I'd
I'd →
like
like
a
a
beer
beer
stopSTOP
0
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
I'd
I'd →
like
like
a
a
beer
beer
stopSTOP
0
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
I'd
I'd →
like
like
a
a
beer
beer
stopSTOP
0
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
I'd
I'd →
like
like
a
a
beer
beer
stopSTOP
0
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
I'd
I'd →
like
like
a
a
beer
beer
stopSTOP
0
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
I'd
I'd →
like
like
a
a
beer
beer
stopSTOP
0
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
I'd
I'd →
like
like
a
a
beer
beer
stopSTOP
0
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
I'd
I'd →
like
like
a
a
beer
beer
stopSTOP
0
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
I'd
I'd →
like
like
a
a
beer
beer
stopSTOP
0
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
I'd
I'd →
like
like
a
a
beer
beer
stopSTOP
0
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
p(e) =
|e|Y
i=1
p(ei|ei�1, ..., e1)
e = (Economic, growth, has, slowed, down, in, recent, years, .)
1-of
-K c
odin
gC
ontin
uous
-spa
ceW
ord
Rep
rese
ntat
ion
si
wi
Rec
urre
ntSt
ate hi
Wor
d Ss
ampl
eui
Rec
urre
ntSt
atez i
f = (La, croissance, économique, s'est, ralentie, ces, dernières, années, .)
ip
Wor
d Pr
obab
ility
Encoder
Decoder
(Forcada&Ñeco, 1997, Kalchbrenner&Blunsom, 2013; Cho et al., 2014; Sutskever
et al., 2014)
p(f |e) =|f |Y
i=1
p(fi|fi�1, ..., f1, e)
Ich möchte ein Bier
x1 x2 x3 x4
Ich möchte ein Bier
x1 x2 x3 x4
�!h 1
�!h 2
�!h 3
�!h 4
Ich möchte ein Bier
x1 x2 x3 x4
�!h 1
�!h 2
�!h 3
�!h 4
�h 1
�h 2
�h 3
�h 4
Ich möchte ein Bier
x1 x2 x3 x4
�!h 1
�!h 2
�!h 3
�!h 4
�h 1
�h 2
�h 3
�h 4
fi = [ �h i;�!h i]
Ich möchte ein Bier
x1 x2 x3 x4
�!h 1
�!h 2
�!h 3
�!h 4
�h 1
�h 2
�h 3
�h 4
fi = [ �h i;�!h i]
Ich möchte ein Bier
x1 x2 x3 x4
�!h 1
�!h 2
�!h 3
�!h 4
�h 1
�h 2
�h 3
�h 4
fi = [ �h i;�!h i]
Ich möchte ein Bier
x1 x2 x3 x4
�!h 1
�!h 2
�!h 3
�!h 4
�h 1
�h 2
�h 3
�h 4
fi = [ �h i;�!h i]
Ich möchte ein Bier
x1 x2 x3 x4
�!h 1
�!h 2
�!h 3
�!h 4
�h 1
�h 2
�h 3
�h 4
Ich mochte ein Bier
F 2 R2n⇥|f |
fi = [ �h i;�!h i]
→
0
Ich mochte ein Bier
→
0
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
→
0
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
I'd
→
0
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
I'd
I'd →
like
0
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
I'd
I'd →
like
like
a
0
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
I'd
I'd →
like
like
a
a
beer
0
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
I'd
I'd →
like
like
a
a
beer
beer
stopSTOP
0
Ich mochte ein Bier
z }| {Attention history:
a>1
a>2
a>3
a>4
a>5
Computing Attention• At each time step (one time step = one output word), we want to be able to
“attend” to different words in the source sentence
• We need a weight for every word: this is an |f|-length vector at
• Here is a simplified version of Bahdanau et al.’s solution
• Use an RNN to predict model output, call the hidden states
• At time t compute the expected input embedding
• Take the dot product with every column in the source matrix to compute the attention energy.
• Exponentiate and normalize to 1:
• Finally, the input source vector for time t is
at = softmax(ut)
ct = Fat
(called in the paper)
(called in the paper)
st
ut = F>rt
rt = Vst�1( is a learned parameter)V
et
↵t
(Since F has |f| columns, has |f| rows)ut
(st has a fixed dimensionality, call it m)
English-French
English-German
I confess that I speak neither French nor German. Sorry!
Summary• Attention is closely related to “pooling” operations in convnets (and other
architectures)
• Bahdanau’s attention model seems to only cares about “content”
• No obvious bias in favor of diagonals, short jumps, fertility, etc.
• Some work has begun to add other “structural” biases (Luong et al., 2015; Cohn et al., 2016), but there are lots more opportunities
• Attention is similar to alignment, but there are important differences
• alignment makes stochastic but hard decisions. Even if the alignment probability distribution is “flat”, the model picks one word or phrase at a time
• attention is “soft” (you add together all the words). Big difference between “flat” and “peaked” attention weights
Summary II• Nonlinear classification, automatic feature
induction, and automatic differentiation are good.
• But just how good are they?