Fancy Recurrent Neural Networks - GitHub Pages · Fancy Recurrent Neural Networks Richard Socher Material from cs224d.stanford.edu. ... GRU intuition 9 Richard Socher 2/1/17 • If

FancyRecurrentNeuralNetworks

RichardSocher

Materialfromcs224d.stanford.edu

Recapofmostimportantconcepts

2/1/17RichardSocher1

Word2Vec

Glove

Nnet &Max-margin



MultilayerNnet

&

Backprop

The second derivative in eq. 28 for output units is simply

@a

(nl

)i

@W

(nl

�1)ij

=@

@W

(nl

�1)ij

z

(nl

)i

=@

@W

(nl

�1)ij

⇣W

(nl

�1)i· a

(nl

�1)⌘= a

(nl

�1)j

. (46)

We adopt standard notation and introduce the error � related to an output unit:

@E

n

@W

(nl

�1)ij

= (yi

� t

i

)a(nl

�1)j

= �

(nl

)i

a

(nl

�1)j

. (47)

So far, we only computed errors for output units, now we will derive �’s for normal hidden units andshow how these errors are backpropagated to compute weight derivatives of lower levels. We will start withsecond to top layer weights from which a generalization to arbitrarily deep layers will become obvious.Similar to eq. 28, we start with the error derivative:

@E

@W

(nl

�2)ij

=X

n

@E

n

@a

(nl

)| {z }�

(nl

)

@a

(nl

)

@W

(nl

�2)ij

+ �W

(nl

�2)ji

. (48)

Now,

(�(nl

))T@a

(nl

)

@W

(nl

�2)ij

= (�(nl

))T@z

(nl

)

@W

(nl

�2)ij

(49)

= (�(nl

))T@

@W

(nl

�2)ij

W

(nl

�1)a

(nl

�1) (50)

= (�(nl

))T@

@W

(nl

�2)ij

W

(nl

�1)·i a

(nl

�1)i

(51)

= (�(nl

))TW (nl

�1)·i

@

@W

(nl

�2)ij

a

(nl

�1)i

(52)

= (�(nl

))TW (nl

�1)·i

@

@W

(nl

�2)ij

f(z(nl

�1)i

) (53)

= (�(nl

))TW (nl

�1)·i

@

@W

(nl

�2)ij

f(W (nl

�2)i· a

(nl

�2)) (54)

= (�(nl

))TW (nl

�1)·i f

0(z(nl

�1)i

)a(nl

�2)j

(55)

=⇣(�(nl

))TW (nl

�1)·i

⌘f

0(z(nl

�1)i

)a(nl

�2)j

(56)

=

0

@s

l+1X

j=1

W

(nl

�1)ji

�

(nl

)j

)

1

Af

0(z(nl

�1)i

)

| {z }

a

(nl

�2)j

(57)

= �

(nl

�1)i

a

(nl

�2)j

(58)

where we used in the first line that the top layer is linear. This is a very detailed account of essentiallyjust the chain rule.

So, we can write the � errors of all layers l (except the top layer) (in vector format, using the Hadamardproduct �):

�

(l) =⇣(W (l))T �(l+1)

⌘� f 0(z(l)), (59)

7

where the sigmoid derivative from eq. 14 gives f 0(z(l)) = (1� a

(l))a(l). Using that definition, we get thehidden layer backprop derivatives:

@

@W

(l)ij

E

R

= a

(l)j

�

(l+1)i

+ �W

(l)ij

(60)

(61)

Which in one simplified vector notation becomes:

@

@W

(l)E

R

= �

(l+1)(a(l))T + �W

(l). (62)

In summary, the backprop procedure consists of four steps:

1. Apply an input x

n

and forward propagate it through the network to get the hidden and outputactivations using eq. 18.

2. Evaluate �

(nl

) for output units using eq. 42.

3. Backpropagate the �’s to obtain a �

(l) for each hidden layer in the network using eq. 59.

4. Evaluate the required derivatives with eq. 62 and update all the weights using an optimizationprocedure such as conjugate gradient or L-BFGS. CG seems to be faster and work better whenusing mini-batches of training data to estimate the derivatives.

If you have any further questions or found errors, please send an email to [email protected]

5 Recursive Neural Networks

Same as backprop in previous section but splitting error derivatives and noting that the derivatives of thesame W at each node can all be added up. Lastly, the delta’s from the parent node and possible delta’sfrom a softmax classifier at each node are just added.

References

[Ben07] Yoshua Bengio. Learning deep architectures for ai. Technical report, Dept. IRO, Universite deMontreal, 2007.

8



RecurrentNeuralNetworks

CrossEntropyError

Mini-batchedSGD

DeepBidirectionalRNNsbyIrsoy andCardie


Going Deep

h! (i)t = f (W

!"! (i)ht(i−1) +V

!" (i)h! (i)t−1 + b! (i))

h! (i)t = f (W

!"" (i)ht(i−1) +V

!" (i)h! (i)t+1 + b! (i))

yt = g(U[h!t(L );h!t(L )]+ c)

y

h(3)

xEach memory layer passes an intermediate sequential representation to the next.

h(2)

h(1)

BetterRecurrentUnits


• Morecomplexhiddenunitcomputationinrecurrence!

• GatedRecurrentUnits(GRU)introducedbyChoetal.2014(seereadinglist)

• Mainideas:• keeparoundmemoriestocapturelongdistance

dependencies

• allowerrormessagestoflowatdifferentstrengthsdependingontheinputs

GRUs


• StandardRNNcomputeshiddenlayeratnexttimestepdirectly:

• GRUfirstcomputesanupdategate (anotherlayer)basedoncurrentinputwordvectorandhiddenstate

• Computeresetgatesimilarlybutwithdifferentweights

GRUs


• Updategate

• Resetgate

• Newmemorycontent:Ifresetgateunitis~0,thenthisignorespreviousmemoryandonlystoresthenewwordinformation

• Finalmemoryattimestepcombinescurrentandprevioustimesteps:

Attemptatacleanillustration


rtrt-1

zt-1

~ht~ht-1

zt

ht-1 ht

xtxt-1Input:

Resetgate

Updategate

Memory(reset)

Finalmemory

GRUintuition


• Ifresetiscloseto0,ignoreprevioushiddenstateà Allowsmodeltodropinformationthatisirrelevantinthefuture

• Updategatezcontrolshowmuchofpaststateshouldmatternow.• Ifzcloseto1,thenwecancopyinformationinthatunit

throughmanytimesteps!Lessvanishinggradient!

• Unitswithshort-termdependenciesoftenhaveresetgatesveryactive

GRUintuition


• Unitswithlongtermdependencieshaveactiveupdategatesz

• Illustration:

• Derivativeof?à restissamechainrule,butimplementwithmodularization orautomaticdifferentiation

where ✓ is the set of the model parameters andeach (xn,yn) is an (input sequence, output se-quence) pair from the training set. In our case,as the output of the decoder, starting from the in-put, is differentiable, we can use a gradient-basedalgorithm to estimate the model parameters.

Once the RNN Encoder–Decoder is trained, themodel can be used in two ways. One way is to usethe model to generate a target sequence given aninput sequence. On the other hand, the model canbe used to score a given pair of input and outputsequences, where the score is simply a probabilityp✓(y | x) from Eqs. (3) and (4).

2.3 Hidden Unit that Adaptively Remembersand Forgets

In addition to a novel model architecture, we alsopropose a new type of hidden unit (f in Eq. (1))that has been motivated by the LSTM unit but ismuch simpler to compute and implement.1 Fig. 2shows the graphical depiction of the proposed hid-den unit.

Let us describe how the activation of the j-thhidden unit is computed. First, the reset gate rj iscomputed by

rj = �

⇣

[Wrx]j +⇥

Urhht�1i⇤

j

⌘

, (5)

where � is the logistic sigmoid function, and [.]jdenotes the j-th element of a vector. x and ht�1

are the input and the previous hidden state, respec-tively. Wr and Ur are weight matrices which arelearned.

Similarly, the update gate zj is computed by

zj = �

⇣

[Wzx]j +⇥

Uzhht�1i⇤

j

⌘

. (6)

The actual activation of the proposed unit hj isthen computed by

h

htij = zjh

ht�1ij + (1� zj)

˜

h

htij , (7)

where

˜

h

htij = �

⇣

[Wx]j +⇥

U�

r� hht�1i�⇤

j

⌘

. (8)

In this formulation, when the reset gate is closeto 0, the hidden state is forced to ignore the pre-vious hidden state and reset with the current input

1 The LSTM unit, which has shown impressive results inseveral applications such as speech recognition, has a mem-ory cell and four gating units that adaptively control the in-formation flow inside the unit, compared to only two gatingunits in the proposed hidden unit. For details on LSTM net-works, see, e.g., (Graves, 2012).

�

��

Figure 2: An illustration of the proposed hiddenactivation function. The update gate z selectswhether the hidden state is to be updated witha new hidden state ˜

h. The reset gate r decideswhether the previous hidden state is ignored. SeeEqs. (5)–(8) for the detailed equations of r, z, hand ˜

h.

only. This effectively allows the hidden state todrop any information that is found to be irrelevantlater in the future, thus, allowing a more compactrepresentation.

On the other hand, the update gate controls howmuch information from the previous hidden statewill carry over to the current hidden state. Thisacts similarly to the memory cell in the LSTMnetwork and helps the RNN to remember long-term information. Furthermore, this may be con-sidered an adaptive variant of a leaky-integrationunit (Bengio et al., 2013).

As each hidden unit has separate reset and up-date gates, each hidden unit will learn to capturedependencies over different time scales. Thoseunits that learn to capture short-term dependencieswill tend to have reset gates that are frequently ac-tive, but those that capture longer-term dependen-cies will have update gates that are mostly active.

In our preliminary experiments, we found thatit is crucial to use this new unit with gating units.We were not able to get meaningful result with anoft-used tanh unit without any gating.

3 Statistical Machine Translation

In a commonly used statistical machine translationsystem (SMT), the goal of the system (decoder,specifically) is to find a translation f given a sourcesentence e, which maximizes

p(f | e) / p(e | f)p(f),

where the first term at the right hand side is calledtranslation model and the latter language model(see, e.g., (Koehn, 2005)). In practice, however,most SMT systems model log p(f | e) as a log-linear model with additional features and corre-

Long-short-term-memories(LSTMs)


• Wecanmaketheunitsevenmorecomplex

• Alloweachtimesteptomodify• Inputgate(currentcellmatters)

• Forget(gate0,forgetpast)

• Output(howmuchcellisexposed)

• Newmemorycell

• Finalmemorycell:

• Finalhiddenstate:

Illustrationsallabitoverwhelming;)


http://people.idsia.ch/~juergen/lstm/sld017.htm

http://deeplearning.net/tutorial/lstm.html

Intuition:memorycellscankeepinformationintact,unlessinputsmakesthemforgetitoroverwriteitwithnewinput.Cellcandecidetooutputthisinformationorjuststoreit

LongShort-TermMemorybyHochreiter andSchmidhuber (1997)

inj

injout j

out j

w ic j

wic j

yc j

g h1.0

netw i w i

yinj yout j

net c j

g yinj

= g+sc jsc j

yinj

h yout j

net

LSTMsarecurrentlyveryhip!


• Envoguedefaultmodelformostsequencelabelingtasks

• Verypowerful,especiallywhenstackedandmadeevendeeper(eachhiddenlayerisalreadycomputedbyadeepinternalnetwork)

• Mostusefulifyouhavelotsandlotsofdata

Summary


• RecurrentNeuralNetworksarepowerful

• Alotofongoingworkrightnow

• GatedRecurrentUnitsevenbetter

• LSTMsmaybeevenbetter(jurystillout)

• Thiswasanadvancedlectureà gainintuition,encourageexploration

• Nextup:RecursiveNeuralNetworkssimplerandalsopowerful:)

Documents

Fancy Recurrent Neural Networks - GitHub Pages · Fancy Recurrent Neural Networks Richard Socher Material from cs224d.stanford.edu. ... GRU intuition 9 Richard Socher 2/1/17 • If