Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Recurrent Network Acoustic Models
Steve Renals
Automatic Speech Recognition ndash ASR Lecture 1329 February 2016
ASR Lecture 13 Recurrent Network Acoustic Models 1
Sequential Data
t-2
x1 x2 x3 x1 x2 x3
t-1
x1 x2 x3
t
output
hidden
input
2 frames of context
DNN for acoustic modelling
MFCC Inputs
CDPhone Outputs
Hidden units
(399=351)
2000
3ndash8 hidden layers
12000
Modelling sequential datawith time dependencesbetween feature vectors
Can model fixed contextwith a feed-forwardnetwork with previous timeinput vectors added to thenetwork input (in signalprocessing this is called FIRndash finite input response)
ASR Lecture 13 Recurrent Network Acoustic Models 2
Sequential Data
x1 x2 x3
t
output
recurrenthidden
input
Donrsquot use an input context win-dow ndash context is learned by therecurrent hidden (state) units
Modelling sequential datawith time dependencesbetween feature vectors
Can model fixed contextwith a feed-forwardnetwork with previous timeinput vectors added to thenetwork input (in signalprocessing this is called FIRndash finite impulse response)
Model sequential inputsusing recurrent connectionsto learn a time-dependentstate (in signal processingthis is called IIR ndash infiniteimpulse response)
ASR Lecture 13 Recurrent Network Acoustic Models 3
Recurrent networks
Can think of recurrent networks in terms of the dynamics of therecurrent hidden state
Settle to a fixed point ndash stable representation for a sequence(eg machine translation)
Regular oscillation (ldquolimit cyclerdquo) ndash learn some kind ofrepetition
Chaotic dynamics (non-repetitive) ndash theoretically interesting(ldquocomputation at the edge of chaosrdquo)
Useful behaviours of recurrent networks
Recurrent state as memory ndash remember things for(potentially) an infinite time
Recurrent state as information compression ndash compress asequence into a state representation
ASR Lecture 13 Recurrent Network Acoustic Models 4
Simplest recurrent network
yk(t) = softmax
(Hsumr=0
w(2)kr hr (t)
)
hj(t) = sigmoid
dsum
s=0
w(1)js xs(t) +
Hsumr=0
w(R)jr hr (t minus 1)︸ ︷︷ ︸
Recurrent part
Hidden (t)
Output (t)
Input (t) Hidden (t-1)
w(1)
w(2)
w(R)
ASR Lecture 13 Recurrent Network Acoustic Models 5
Recurrent network unfolded in time
Hidden (t)
Output (t)
Input (t)
Hidden (t-1)
w(1)
w(2)
w(R)
Hidden (t+1)
Input (t-1)
Output (t-1)
Input (t+1)
Output (t+1)
w(2)w(2)
w(1)w(1)
w(R)w(R) w(R)
An RNN for a sequence of T inputs can be viewed as a deepT -layer network with shared weights
We can train an RNN by doing backprop through thisunfolded network making sure we share the weightsWeight sharing
if two weights are constrained to be equal (w1 = w2) then theywill stay equal if the weight changes are equal(partEpartw1 = partEpartw2)achieve this by updating with (partEpartw1 + partEpartw2) (cf ConvNets)
ASR Lecture 13 Recurrent Network Acoustic Models 6
Back-propagation through time (BPTT)
We can train a network by unfolding and back-propagatingthrough time summing the derivatives for each weight as wego through the sequence
More efficiently run as a recurrent network
cache the unit outputs at each timestepcache the output errors at each timestepthen backprop from the final timestep to zero computing thederivatives at each stepcompute the weight updates by summing the derivatives acrosstime
Expensive ndash backprop for a 1000 item sequence equivalent toa 1000-layer feed-forward network
Truncated BPTT ndash backprop through just a few time steps(eg 20)
ASR Lecture 13 Recurrent Network Acoustic Models 7
Vanishing and exploding gradients
BPTT involves taking the product of many gradients (as in avery deep network) ndash this can lead to vanishing (componentgradients less than 1) or exploding (greater than 1) gradients
This can prevent effective training
Modified optimisation algorithms
RMSProp (normalise the gradient for each weight by averageof it magnitude learning rate for each weight)Hessian-free ndash an approximation to second-order approacheswhich use curvature information
Modified hidden unit transfer functionsLong short term memory (LSTM)
Linear self-recurrence for each hidden unit (long-term memory)Gates - dynamic weights which are a function of the inputs
ReLUs
ASR Lecture 13 Recurrent Network Acoustic Models 8
Recurrent networks in speech recognition
1990s ndash Hybrid RNNHMM speech recognition (Robinson etal)
2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)
2013 onwards ndash RNNLSTM models at state of the art foracoustic modelling (Graves Sak et al)
2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)
ASR Lecture 13 Recurrent Network Acoustic Models 9
Recurrent networks in speech recognition
1990s ndash Hybrid RNNHMM speech recognition(Robinson et al)
2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)
2013 onwards ndash RNNLSTM models at state of the artfor acoustic modelling (Graves Sak et al)
2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)
ASR Lecture 13 Recurrent Network Acoustic Models 9
Speech recognition with recurrent networks (1990s)
time (ms)
freq (
Hz)
0 200 400 600 800 1000 1200 14000
2000
4000
6000
8000
RecurrentNeuralNetwork
SpeechAcoustics
Phoneme Probabilities
Robinson et al (1996)
ASR Lecture 13 Recurrent Network Acoustic Models 10
1990s RNN
Features
MEL+ Filter-bank outputs + voicing parameters (23featuresframe)PLP cepstral coefficients (13 features)Coefficients normalized to zero mean and unit variance
256 hidden (state) units
79 context-independent phone classes (outputs)
Output training target delayed by 5 frames
Use RNN scaled likelihoods in hybrid RNNHMM
About 100k trainable parameters
Trained using stochastic BPTT (using method similar toRpropRMSprop) on WSJ0 (3M training examples)
In 1994 training took five days on a specially designed parallelcomputer (the RAP)
ASR Lecture 13 Recurrent Network Acoustic Models 11
Combined RNN system
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
Mel +
PLP
Mel +
PLP
sh ow
iym
Speech waveform Preprocessor Recurrent net Markov mode
Individual systems WER of 14ndash15 on WSJ ldquospoke 6 datardquoInterpolated in log domain WER=11(Best context dependent GMMHMM system in 1995 WER=8)
ASR Lecture 13 Recurrent Network Acoustic Models 12
LSTM v1
I(t) H(t-1)
H(t)
BasicSigmoid
Unit
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
LinearRecurrence
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
OutputGate
InputGate
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v2
I(t) H(t-1)
H(t)
ForgetGate
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471
ASR Lecture 13 Recurrent Network Acoustic Models 14
LSTM v3
I(t) H(t-1)
H(t)
C(t)
C(t-1)
PeepholeConnections
ASR Lecture 13 Recurrent Network Acoustic Models 15
LSTM equations
2 LSTM ARCHITECTURES
In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows
W = nc nc 4 + ni nc 4 + nc no + nc 3
where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive
As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units
An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = Wymmt + by (6)
where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product
inpu
t
g ct1 h
it
ft
ct
ot recu
rren
tpr
ojec
tion
outp
ut
xt
mt
pt
rt
rt1
yt
memory blocks
Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity
of the vectors and g and h are the cell input and cell output activationfunctions generally tanh
With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows
it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)
ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)
mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)
yt = Wyrrt + Wyppt + by (14)(15)
where the r and p denote the recurrent and optional non-recurrentunit activations
21 Implementation
We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization
We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively
ASR Lecture 13 Recurrent Network Acoustic Models 16
Google LSTM experiments (2014ndashdate)
Google voice search data ndash 1900h training set
Input features 40-d log mel filterbank energies
No input context single frame 40-d frame presented eachtimestep
LSTM networks
400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states
Delay output targets by 5 frames
Use RNN to estimated scaled likelihoods
Train using BPTT
ASR Lecture 13 Recurrent Network Acoustic Models 17
LSTM acoustic modelling architectures
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 18
LSTM with projection layer (LSTMP)
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 19
LSTM ASR Results (Google Voice Search task)
Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters
C P Depth N Dev Train WER() () ()
840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124
6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107
2048 512 1L 13M 673 718 113
where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc
By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer
With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added
rt = Wrmmt (7)yt = (Wyrrt + by) (8)
where the r denote the recurrent unit activations
24 Deep LSTMP
Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model
3 Distributed Training Scaling up toLarge Models with Parallelization
We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized
instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization
We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes
This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism
4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies
41 Systems amp Evaluation
All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value
(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20
Reading
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798
1735VtN93Mehb5k
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162
089976600300015015VtN9ncehb5k
T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http
wwwcstredacukdownloadspublications1996rnn4csr96pdf
H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf
ASR Lecture 13 Recurrent Network Acoustic Models 21
Sequential Data
t-2
x1 x2 x3 x1 x2 x3
t-1
x1 x2 x3
t
output
hidden
input
2 frames of context
DNN for acoustic modelling
MFCC Inputs
CDPhone Outputs
Hidden units
(399=351)
2000
3ndash8 hidden layers
12000
Modelling sequential datawith time dependencesbetween feature vectors
Can model fixed contextwith a feed-forwardnetwork with previous timeinput vectors added to thenetwork input (in signalprocessing this is called FIRndash finite input response)
ASR Lecture 13 Recurrent Network Acoustic Models 2
Sequential Data
x1 x2 x3
t
output
recurrenthidden
input
Donrsquot use an input context win-dow ndash context is learned by therecurrent hidden (state) units
Modelling sequential datawith time dependencesbetween feature vectors
Can model fixed contextwith a feed-forwardnetwork with previous timeinput vectors added to thenetwork input (in signalprocessing this is called FIRndash finite impulse response)
Model sequential inputsusing recurrent connectionsto learn a time-dependentstate (in signal processingthis is called IIR ndash infiniteimpulse response)
ASR Lecture 13 Recurrent Network Acoustic Models 3
Recurrent networks
Can think of recurrent networks in terms of the dynamics of therecurrent hidden state
Settle to a fixed point ndash stable representation for a sequence(eg machine translation)
Regular oscillation (ldquolimit cyclerdquo) ndash learn some kind ofrepetition
Chaotic dynamics (non-repetitive) ndash theoretically interesting(ldquocomputation at the edge of chaosrdquo)
Useful behaviours of recurrent networks
Recurrent state as memory ndash remember things for(potentially) an infinite time
Recurrent state as information compression ndash compress asequence into a state representation
ASR Lecture 13 Recurrent Network Acoustic Models 4
Simplest recurrent network
yk(t) = softmax
(Hsumr=0
w(2)kr hr (t)
)
hj(t) = sigmoid
dsum
s=0
w(1)js xs(t) +
Hsumr=0
w(R)jr hr (t minus 1)︸ ︷︷ ︸
Recurrent part
Hidden (t)
Output (t)
Input (t) Hidden (t-1)
w(1)
w(2)
w(R)
ASR Lecture 13 Recurrent Network Acoustic Models 5
Recurrent network unfolded in time
Hidden (t)
Output (t)
Input (t)
Hidden (t-1)
w(1)
w(2)
w(R)
Hidden (t+1)
Input (t-1)
Output (t-1)
Input (t+1)
Output (t+1)
w(2)w(2)
w(1)w(1)
w(R)w(R) w(R)
An RNN for a sequence of T inputs can be viewed as a deepT -layer network with shared weights
We can train an RNN by doing backprop through thisunfolded network making sure we share the weightsWeight sharing
if two weights are constrained to be equal (w1 = w2) then theywill stay equal if the weight changes are equal(partEpartw1 = partEpartw2)achieve this by updating with (partEpartw1 + partEpartw2) (cf ConvNets)
ASR Lecture 13 Recurrent Network Acoustic Models 6
Back-propagation through time (BPTT)
We can train a network by unfolding and back-propagatingthrough time summing the derivatives for each weight as wego through the sequence
More efficiently run as a recurrent network
cache the unit outputs at each timestepcache the output errors at each timestepthen backprop from the final timestep to zero computing thederivatives at each stepcompute the weight updates by summing the derivatives acrosstime
Expensive ndash backprop for a 1000 item sequence equivalent toa 1000-layer feed-forward network
Truncated BPTT ndash backprop through just a few time steps(eg 20)
ASR Lecture 13 Recurrent Network Acoustic Models 7
Vanishing and exploding gradients
BPTT involves taking the product of many gradients (as in avery deep network) ndash this can lead to vanishing (componentgradients less than 1) or exploding (greater than 1) gradients
This can prevent effective training
Modified optimisation algorithms
RMSProp (normalise the gradient for each weight by averageof it magnitude learning rate for each weight)Hessian-free ndash an approximation to second-order approacheswhich use curvature information
Modified hidden unit transfer functionsLong short term memory (LSTM)
Linear self-recurrence for each hidden unit (long-term memory)Gates - dynamic weights which are a function of the inputs
ReLUs
ASR Lecture 13 Recurrent Network Acoustic Models 8
Recurrent networks in speech recognition
1990s ndash Hybrid RNNHMM speech recognition (Robinson etal)
2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)
2013 onwards ndash RNNLSTM models at state of the art foracoustic modelling (Graves Sak et al)
2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)
ASR Lecture 13 Recurrent Network Acoustic Models 9
Recurrent networks in speech recognition
1990s ndash Hybrid RNNHMM speech recognition(Robinson et al)
2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)
2013 onwards ndash RNNLSTM models at state of the artfor acoustic modelling (Graves Sak et al)
2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)
ASR Lecture 13 Recurrent Network Acoustic Models 9
Speech recognition with recurrent networks (1990s)
time (ms)
freq (
Hz)
0 200 400 600 800 1000 1200 14000
2000
4000
6000
8000
RecurrentNeuralNetwork
SpeechAcoustics
Phoneme Probabilities
Robinson et al (1996)
ASR Lecture 13 Recurrent Network Acoustic Models 10
1990s RNN
Features
MEL+ Filter-bank outputs + voicing parameters (23featuresframe)PLP cepstral coefficients (13 features)Coefficients normalized to zero mean and unit variance
256 hidden (state) units
79 context-independent phone classes (outputs)
Output training target delayed by 5 frames
Use RNN scaled likelihoods in hybrid RNNHMM
About 100k trainable parameters
Trained using stochastic BPTT (using method similar toRpropRMSprop) on WSJ0 (3M training examples)
In 1994 training took five days on a specially designed parallelcomputer (the RAP)
ASR Lecture 13 Recurrent Network Acoustic Models 11
Combined RNN system
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
Mel +
PLP
Mel +
PLP
sh ow
iym
Speech waveform Preprocessor Recurrent net Markov mode
Individual systems WER of 14ndash15 on WSJ ldquospoke 6 datardquoInterpolated in log domain WER=11(Best context dependent GMMHMM system in 1995 WER=8)
ASR Lecture 13 Recurrent Network Acoustic Models 12
LSTM v1
I(t) H(t-1)
H(t)
BasicSigmoid
Unit
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
LinearRecurrence
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
OutputGate
InputGate
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v2
I(t) H(t-1)
H(t)
ForgetGate
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471
ASR Lecture 13 Recurrent Network Acoustic Models 14
LSTM v3
I(t) H(t-1)
H(t)
C(t)
C(t-1)
PeepholeConnections
ASR Lecture 13 Recurrent Network Acoustic Models 15
LSTM equations
2 LSTM ARCHITECTURES
In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows
W = nc nc 4 + ni nc 4 + nc no + nc 3
where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive
As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units
An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = Wymmt + by (6)
where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product
inpu
t
g ct1 h
it
ft
ct
ot recu
rren
tpr
ojec
tion
outp
ut
xt
mt
pt
rt
rt1
yt
memory blocks
Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity
of the vectors and g and h are the cell input and cell output activationfunctions generally tanh
With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows
it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)
ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)
mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)
yt = Wyrrt + Wyppt + by (14)(15)
where the r and p denote the recurrent and optional non-recurrentunit activations
21 Implementation
We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization
We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively
ASR Lecture 13 Recurrent Network Acoustic Models 16
Google LSTM experiments (2014ndashdate)
Google voice search data ndash 1900h training set
Input features 40-d log mel filterbank energies
No input context single frame 40-d frame presented eachtimestep
LSTM networks
400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states
Delay output targets by 5 frames
Use RNN to estimated scaled likelihoods
Train using BPTT
ASR Lecture 13 Recurrent Network Acoustic Models 17
LSTM acoustic modelling architectures
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 18
LSTM with projection layer (LSTMP)
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 19
LSTM ASR Results (Google Voice Search task)
Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters
C P Depth N Dev Train WER() () ()
840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124
6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107
2048 512 1L 13M 673 718 113
where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc
By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer
With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added
rt = Wrmmt (7)yt = (Wyrrt + by) (8)
where the r denote the recurrent unit activations
24 Deep LSTMP
Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model
3 Distributed Training Scaling up toLarge Models with Parallelization
We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized
instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization
We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes
This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism
4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies
41 Systems amp Evaluation
All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value
(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20
Reading
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798
1735VtN93Mehb5k
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162
089976600300015015VtN9ncehb5k
T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http
wwwcstredacukdownloadspublications1996rnn4csr96pdf
H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf
ASR Lecture 13 Recurrent Network Acoustic Models 21
Sequential Data
x1 x2 x3
t
output
recurrenthidden
input
Donrsquot use an input context win-dow ndash context is learned by therecurrent hidden (state) units
Modelling sequential datawith time dependencesbetween feature vectors
Can model fixed contextwith a feed-forwardnetwork with previous timeinput vectors added to thenetwork input (in signalprocessing this is called FIRndash finite impulse response)
Model sequential inputsusing recurrent connectionsto learn a time-dependentstate (in signal processingthis is called IIR ndash infiniteimpulse response)
ASR Lecture 13 Recurrent Network Acoustic Models 3
Recurrent networks
Can think of recurrent networks in terms of the dynamics of therecurrent hidden state
Settle to a fixed point ndash stable representation for a sequence(eg machine translation)
Regular oscillation (ldquolimit cyclerdquo) ndash learn some kind ofrepetition
Chaotic dynamics (non-repetitive) ndash theoretically interesting(ldquocomputation at the edge of chaosrdquo)
Useful behaviours of recurrent networks
Recurrent state as memory ndash remember things for(potentially) an infinite time
Recurrent state as information compression ndash compress asequence into a state representation
ASR Lecture 13 Recurrent Network Acoustic Models 4
Simplest recurrent network
yk(t) = softmax
(Hsumr=0
w(2)kr hr (t)
)
hj(t) = sigmoid
dsum
s=0
w(1)js xs(t) +
Hsumr=0
w(R)jr hr (t minus 1)︸ ︷︷ ︸
Recurrent part
Hidden (t)
Output (t)
Input (t) Hidden (t-1)
w(1)
w(2)
w(R)
ASR Lecture 13 Recurrent Network Acoustic Models 5
Recurrent network unfolded in time
Hidden (t)
Output (t)
Input (t)
Hidden (t-1)
w(1)
w(2)
w(R)
Hidden (t+1)
Input (t-1)
Output (t-1)
Input (t+1)
Output (t+1)
w(2)w(2)
w(1)w(1)
w(R)w(R) w(R)
An RNN for a sequence of T inputs can be viewed as a deepT -layer network with shared weights
We can train an RNN by doing backprop through thisunfolded network making sure we share the weightsWeight sharing
if two weights are constrained to be equal (w1 = w2) then theywill stay equal if the weight changes are equal(partEpartw1 = partEpartw2)achieve this by updating with (partEpartw1 + partEpartw2) (cf ConvNets)
ASR Lecture 13 Recurrent Network Acoustic Models 6
Back-propagation through time (BPTT)
We can train a network by unfolding and back-propagatingthrough time summing the derivatives for each weight as wego through the sequence
More efficiently run as a recurrent network
cache the unit outputs at each timestepcache the output errors at each timestepthen backprop from the final timestep to zero computing thederivatives at each stepcompute the weight updates by summing the derivatives acrosstime
Expensive ndash backprop for a 1000 item sequence equivalent toa 1000-layer feed-forward network
Truncated BPTT ndash backprop through just a few time steps(eg 20)
ASR Lecture 13 Recurrent Network Acoustic Models 7
Vanishing and exploding gradients
BPTT involves taking the product of many gradients (as in avery deep network) ndash this can lead to vanishing (componentgradients less than 1) or exploding (greater than 1) gradients
This can prevent effective training
Modified optimisation algorithms
RMSProp (normalise the gradient for each weight by averageof it magnitude learning rate for each weight)Hessian-free ndash an approximation to second-order approacheswhich use curvature information
Modified hidden unit transfer functionsLong short term memory (LSTM)
Linear self-recurrence for each hidden unit (long-term memory)Gates - dynamic weights which are a function of the inputs
ReLUs
ASR Lecture 13 Recurrent Network Acoustic Models 8
Recurrent networks in speech recognition
1990s ndash Hybrid RNNHMM speech recognition (Robinson etal)
2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)
2013 onwards ndash RNNLSTM models at state of the art foracoustic modelling (Graves Sak et al)
2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)
ASR Lecture 13 Recurrent Network Acoustic Models 9
Recurrent networks in speech recognition
1990s ndash Hybrid RNNHMM speech recognition(Robinson et al)
2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)
2013 onwards ndash RNNLSTM models at state of the artfor acoustic modelling (Graves Sak et al)
2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)
ASR Lecture 13 Recurrent Network Acoustic Models 9
Speech recognition with recurrent networks (1990s)
time (ms)
freq (
Hz)
0 200 400 600 800 1000 1200 14000
2000
4000
6000
8000
RecurrentNeuralNetwork
SpeechAcoustics
Phoneme Probabilities
Robinson et al (1996)
ASR Lecture 13 Recurrent Network Acoustic Models 10
1990s RNN
Features
MEL+ Filter-bank outputs + voicing parameters (23featuresframe)PLP cepstral coefficients (13 features)Coefficients normalized to zero mean and unit variance
256 hidden (state) units
79 context-independent phone classes (outputs)
Output training target delayed by 5 frames
Use RNN scaled likelihoods in hybrid RNNHMM
About 100k trainable parameters
Trained using stochastic BPTT (using method similar toRpropRMSprop) on WSJ0 (3M training examples)
In 1994 training took five days on a specially designed parallelcomputer (the RAP)
ASR Lecture 13 Recurrent Network Acoustic Models 11
Combined RNN system
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
Mel +
PLP
Mel +
PLP
sh ow
iym
Speech waveform Preprocessor Recurrent net Markov mode
Individual systems WER of 14ndash15 on WSJ ldquospoke 6 datardquoInterpolated in log domain WER=11(Best context dependent GMMHMM system in 1995 WER=8)
ASR Lecture 13 Recurrent Network Acoustic Models 12
LSTM v1
I(t) H(t-1)
H(t)
BasicSigmoid
Unit
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
LinearRecurrence
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
OutputGate
InputGate
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v2
I(t) H(t-1)
H(t)
ForgetGate
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471
ASR Lecture 13 Recurrent Network Acoustic Models 14
LSTM v3
I(t) H(t-1)
H(t)
C(t)
C(t-1)
PeepholeConnections
ASR Lecture 13 Recurrent Network Acoustic Models 15
LSTM equations
2 LSTM ARCHITECTURES
In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows
W = nc nc 4 + ni nc 4 + nc no + nc 3
where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive
As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units
An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = Wymmt + by (6)
where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product
inpu
t
g ct1 h
it
ft
ct
ot recu
rren
tpr
ojec
tion
outp
ut
xt
mt
pt
rt
rt1
yt
memory blocks
Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity
of the vectors and g and h are the cell input and cell output activationfunctions generally tanh
With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows
it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)
ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)
mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)
yt = Wyrrt + Wyppt + by (14)(15)
where the r and p denote the recurrent and optional non-recurrentunit activations
21 Implementation
We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization
We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively
ASR Lecture 13 Recurrent Network Acoustic Models 16
Google LSTM experiments (2014ndashdate)
Google voice search data ndash 1900h training set
Input features 40-d log mel filterbank energies
No input context single frame 40-d frame presented eachtimestep
LSTM networks
400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states
Delay output targets by 5 frames
Use RNN to estimated scaled likelihoods
Train using BPTT
ASR Lecture 13 Recurrent Network Acoustic Models 17
LSTM acoustic modelling architectures
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 18
LSTM with projection layer (LSTMP)
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 19
LSTM ASR Results (Google Voice Search task)
Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters
C P Depth N Dev Train WER() () ()
840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124
6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107
2048 512 1L 13M 673 718 113
where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc
By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer
With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added
rt = Wrmmt (7)yt = (Wyrrt + by) (8)
where the r denote the recurrent unit activations
24 Deep LSTMP
Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model
3 Distributed Training Scaling up toLarge Models with Parallelization
We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized
instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization
We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes
This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism
4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies
41 Systems amp Evaluation
All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value
(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20
Reading
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798
1735VtN93Mehb5k
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162
089976600300015015VtN9ncehb5k
T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http
wwwcstredacukdownloadspublications1996rnn4csr96pdf
H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf
ASR Lecture 13 Recurrent Network Acoustic Models 21
Recurrent networks
Can think of recurrent networks in terms of the dynamics of therecurrent hidden state
Settle to a fixed point ndash stable representation for a sequence(eg machine translation)
Regular oscillation (ldquolimit cyclerdquo) ndash learn some kind ofrepetition
Chaotic dynamics (non-repetitive) ndash theoretically interesting(ldquocomputation at the edge of chaosrdquo)
Useful behaviours of recurrent networks
Recurrent state as memory ndash remember things for(potentially) an infinite time
Recurrent state as information compression ndash compress asequence into a state representation
ASR Lecture 13 Recurrent Network Acoustic Models 4
Simplest recurrent network
yk(t) = softmax
(Hsumr=0
w(2)kr hr (t)
)
hj(t) = sigmoid
dsum
s=0
w(1)js xs(t) +
Hsumr=0
w(R)jr hr (t minus 1)︸ ︷︷ ︸
Recurrent part
Hidden (t)
Output (t)
Input (t) Hidden (t-1)
w(1)
w(2)
w(R)
ASR Lecture 13 Recurrent Network Acoustic Models 5
Recurrent network unfolded in time
Hidden (t)
Output (t)
Input (t)
Hidden (t-1)
w(1)
w(2)
w(R)
Hidden (t+1)
Input (t-1)
Output (t-1)
Input (t+1)
Output (t+1)
w(2)w(2)
w(1)w(1)
w(R)w(R) w(R)
An RNN for a sequence of T inputs can be viewed as a deepT -layer network with shared weights
We can train an RNN by doing backprop through thisunfolded network making sure we share the weightsWeight sharing
if two weights are constrained to be equal (w1 = w2) then theywill stay equal if the weight changes are equal(partEpartw1 = partEpartw2)achieve this by updating with (partEpartw1 + partEpartw2) (cf ConvNets)
ASR Lecture 13 Recurrent Network Acoustic Models 6
Back-propagation through time (BPTT)
We can train a network by unfolding and back-propagatingthrough time summing the derivatives for each weight as wego through the sequence
More efficiently run as a recurrent network
cache the unit outputs at each timestepcache the output errors at each timestepthen backprop from the final timestep to zero computing thederivatives at each stepcompute the weight updates by summing the derivatives acrosstime
Expensive ndash backprop for a 1000 item sequence equivalent toa 1000-layer feed-forward network
Truncated BPTT ndash backprop through just a few time steps(eg 20)
ASR Lecture 13 Recurrent Network Acoustic Models 7
Vanishing and exploding gradients
BPTT involves taking the product of many gradients (as in avery deep network) ndash this can lead to vanishing (componentgradients less than 1) or exploding (greater than 1) gradients
This can prevent effective training
Modified optimisation algorithms
RMSProp (normalise the gradient for each weight by averageof it magnitude learning rate for each weight)Hessian-free ndash an approximation to second-order approacheswhich use curvature information
Modified hidden unit transfer functionsLong short term memory (LSTM)
Linear self-recurrence for each hidden unit (long-term memory)Gates - dynamic weights which are a function of the inputs
ReLUs
ASR Lecture 13 Recurrent Network Acoustic Models 8
Recurrent networks in speech recognition
1990s ndash Hybrid RNNHMM speech recognition (Robinson etal)
2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)
2013 onwards ndash RNNLSTM models at state of the art foracoustic modelling (Graves Sak et al)
2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)
ASR Lecture 13 Recurrent Network Acoustic Models 9
Recurrent networks in speech recognition
1990s ndash Hybrid RNNHMM speech recognition(Robinson et al)
2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)
2013 onwards ndash RNNLSTM models at state of the artfor acoustic modelling (Graves Sak et al)
2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)
ASR Lecture 13 Recurrent Network Acoustic Models 9
Speech recognition with recurrent networks (1990s)
time (ms)
freq (
Hz)
0 200 400 600 800 1000 1200 14000
2000
4000
6000
8000
RecurrentNeuralNetwork
SpeechAcoustics
Phoneme Probabilities
Robinson et al (1996)
ASR Lecture 13 Recurrent Network Acoustic Models 10
1990s RNN
Features
MEL+ Filter-bank outputs + voicing parameters (23featuresframe)PLP cepstral coefficients (13 features)Coefficients normalized to zero mean and unit variance
256 hidden (state) units
79 context-independent phone classes (outputs)
Output training target delayed by 5 frames
Use RNN scaled likelihoods in hybrid RNNHMM
About 100k trainable parameters
Trained using stochastic BPTT (using method similar toRpropRMSprop) on WSJ0 (3M training examples)
In 1994 training took five days on a specially designed parallelcomputer (the RAP)
ASR Lecture 13 Recurrent Network Acoustic Models 11
Combined RNN system
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
Mel +
PLP
Mel +
PLP
sh ow
iym
Speech waveform Preprocessor Recurrent net Markov mode
Individual systems WER of 14ndash15 on WSJ ldquospoke 6 datardquoInterpolated in log domain WER=11(Best context dependent GMMHMM system in 1995 WER=8)
ASR Lecture 13 Recurrent Network Acoustic Models 12
LSTM v1
I(t) H(t-1)
H(t)
BasicSigmoid
Unit
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
LinearRecurrence
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
OutputGate
InputGate
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v2
I(t) H(t-1)
H(t)
ForgetGate
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471
ASR Lecture 13 Recurrent Network Acoustic Models 14
LSTM v3
I(t) H(t-1)
H(t)
C(t)
C(t-1)
PeepholeConnections
ASR Lecture 13 Recurrent Network Acoustic Models 15
LSTM equations
2 LSTM ARCHITECTURES
In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows
W = nc nc 4 + ni nc 4 + nc no + nc 3
where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive
As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units
An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = Wymmt + by (6)
where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product
inpu
t
g ct1 h
it
ft
ct
ot recu
rren
tpr
ojec
tion
outp
ut
xt
mt
pt
rt
rt1
yt
memory blocks
Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity
of the vectors and g and h are the cell input and cell output activationfunctions generally tanh
With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows
it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)
ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)
mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)
yt = Wyrrt + Wyppt + by (14)(15)
where the r and p denote the recurrent and optional non-recurrentunit activations
21 Implementation
We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization
We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively
ASR Lecture 13 Recurrent Network Acoustic Models 16
Google LSTM experiments (2014ndashdate)
Google voice search data ndash 1900h training set
Input features 40-d log mel filterbank energies
No input context single frame 40-d frame presented eachtimestep
LSTM networks
400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states
Delay output targets by 5 frames
Use RNN to estimated scaled likelihoods
Train using BPTT
ASR Lecture 13 Recurrent Network Acoustic Models 17
LSTM acoustic modelling architectures
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 18
LSTM with projection layer (LSTMP)
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 19
LSTM ASR Results (Google Voice Search task)
Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters
C P Depth N Dev Train WER() () ()
840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124
6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107
2048 512 1L 13M 673 718 113
where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc
By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer
With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added
rt = Wrmmt (7)yt = (Wyrrt + by) (8)
where the r denote the recurrent unit activations
24 Deep LSTMP
Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model
3 Distributed Training Scaling up toLarge Models with Parallelization
We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized
instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization
We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes
This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism
4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies
41 Systems amp Evaluation
All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value
(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20
Reading
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798
1735VtN93Mehb5k
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162
089976600300015015VtN9ncehb5k
T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http
wwwcstredacukdownloadspublications1996rnn4csr96pdf
H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf
ASR Lecture 13 Recurrent Network Acoustic Models 21
Simplest recurrent network
yk(t) = softmax
(Hsumr=0
w(2)kr hr (t)
)
hj(t) = sigmoid
dsum
s=0
w(1)js xs(t) +
Hsumr=0
w(R)jr hr (t minus 1)︸ ︷︷ ︸
Recurrent part
Hidden (t)
Output (t)
Input (t) Hidden (t-1)
w(1)
w(2)
w(R)
ASR Lecture 13 Recurrent Network Acoustic Models 5
Recurrent network unfolded in time
Hidden (t)
Output (t)
Input (t)
Hidden (t-1)
w(1)
w(2)
w(R)
Hidden (t+1)
Input (t-1)
Output (t-1)
Input (t+1)
Output (t+1)
w(2)w(2)
w(1)w(1)
w(R)w(R) w(R)
An RNN for a sequence of T inputs can be viewed as a deepT -layer network with shared weights
We can train an RNN by doing backprop through thisunfolded network making sure we share the weightsWeight sharing
if two weights are constrained to be equal (w1 = w2) then theywill stay equal if the weight changes are equal(partEpartw1 = partEpartw2)achieve this by updating with (partEpartw1 + partEpartw2) (cf ConvNets)
ASR Lecture 13 Recurrent Network Acoustic Models 6
Back-propagation through time (BPTT)
We can train a network by unfolding and back-propagatingthrough time summing the derivatives for each weight as wego through the sequence
More efficiently run as a recurrent network
cache the unit outputs at each timestepcache the output errors at each timestepthen backprop from the final timestep to zero computing thederivatives at each stepcompute the weight updates by summing the derivatives acrosstime
Expensive ndash backprop for a 1000 item sequence equivalent toa 1000-layer feed-forward network
Truncated BPTT ndash backprop through just a few time steps(eg 20)
ASR Lecture 13 Recurrent Network Acoustic Models 7
Vanishing and exploding gradients
BPTT involves taking the product of many gradients (as in avery deep network) ndash this can lead to vanishing (componentgradients less than 1) or exploding (greater than 1) gradients
This can prevent effective training
Modified optimisation algorithms
RMSProp (normalise the gradient for each weight by averageof it magnitude learning rate for each weight)Hessian-free ndash an approximation to second-order approacheswhich use curvature information
Modified hidden unit transfer functionsLong short term memory (LSTM)
Linear self-recurrence for each hidden unit (long-term memory)Gates - dynamic weights which are a function of the inputs
ReLUs
ASR Lecture 13 Recurrent Network Acoustic Models 8
Recurrent networks in speech recognition
1990s ndash Hybrid RNNHMM speech recognition (Robinson etal)
2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)
2013 onwards ndash RNNLSTM models at state of the art foracoustic modelling (Graves Sak et al)
2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)
ASR Lecture 13 Recurrent Network Acoustic Models 9
Recurrent networks in speech recognition
1990s ndash Hybrid RNNHMM speech recognition(Robinson et al)
2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)
2013 onwards ndash RNNLSTM models at state of the artfor acoustic modelling (Graves Sak et al)
2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)
ASR Lecture 13 Recurrent Network Acoustic Models 9
Speech recognition with recurrent networks (1990s)
time (ms)
freq (
Hz)
0 200 400 600 800 1000 1200 14000
2000
4000
6000
8000
RecurrentNeuralNetwork
SpeechAcoustics
Phoneme Probabilities
Robinson et al (1996)
ASR Lecture 13 Recurrent Network Acoustic Models 10
1990s RNN
Features
MEL+ Filter-bank outputs + voicing parameters (23featuresframe)PLP cepstral coefficients (13 features)Coefficients normalized to zero mean and unit variance
256 hidden (state) units
79 context-independent phone classes (outputs)
Output training target delayed by 5 frames
Use RNN scaled likelihoods in hybrid RNNHMM
About 100k trainable parameters
Trained using stochastic BPTT (using method similar toRpropRMSprop) on WSJ0 (3M training examples)
In 1994 training took five days on a specially designed parallelcomputer (the RAP)
ASR Lecture 13 Recurrent Network Acoustic Models 11
Combined RNN system
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
Mel +
PLP
Mel +
PLP
sh ow
iym
Speech waveform Preprocessor Recurrent net Markov mode
Individual systems WER of 14ndash15 on WSJ ldquospoke 6 datardquoInterpolated in log domain WER=11(Best context dependent GMMHMM system in 1995 WER=8)
ASR Lecture 13 Recurrent Network Acoustic Models 12
LSTM v1
I(t) H(t-1)
H(t)
BasicSigmoid
Unit
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
LinearRecurrence
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
OutputGate
InputGate
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v2
I(t) H(t-1)
H(t)
ForgetGate
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471
ASR Lecture 13 Recurrent Network Acoustic Models 14
LSTM v3
I(t) H(t-1)
H(t)
C(t)
C(t-1)
PeepholeConnections
ASR Lecture 13 Recurrent Network Acoustic Models 15
LSTM equations
2 LSTM ARCHITECTURES
In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows
W = nc nc 4 + ni nc 4 + nc no + nc 3
where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive
As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units
An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = Wymmt + by (6)
where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product
inpu
t
g ct1 h
it
ft
ct
ot recu
rren
tpr
ojec
tion
outp
ut
xt
mt
pt
rt
rt1
yt
memory blocks
Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity
of the vectors and g and h are the cell input and cell output activationfunctions generally tanh
With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows
it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)
ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)
mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)
yt = Wyrrt + Wyppt + by (14)(15)
where the r and p denote the recurrent and optional non-recurrentunit activations
21 Implementation
We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization
We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively
ASR Lecture 13 Recurrent Network Acoustic Models 16
Google LSTM experiments (2014ndashdate)
Google voice search data ndash 1900h training set
Input features 40-d log mel filterbank energies
No input context single frame 40-d frame presented eachtimestep
LSTM networks
400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states
Delay output targets by 5 frames
Use RNN to estimated scaled likelihoods
Train using BPTT
ASR Lecture 13 Recurrent Network Acoustic Models 17
LSTM acoustic modelling architectures
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 18
LSTM with projection layer (LSTMP)
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 19
LSTM ASR Results (Google Voice Search task)
Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters
C P Depth N Dev Train WER() () ()
840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124
6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107
2048 512 1L 13M 673 718 113
where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc
By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer
With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added
rt = Wrmmt (7)yt = (Wyrrt + by) (8)
where the r denote the recurrent unit activations
24 Deep LSTMP
Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model
3 Distributed Training Scaling up toLarge Models with Parallelization
We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized
instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization
We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes
This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism
4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies
41 Systems amp Evaluation
All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value
(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20
Reading
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798
1735VtN93Mehb5k
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162
089976600300015015VtN9ncehb5k
T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http
wwwcstredacukdownloadspublications1996rnn4csr96pdf
H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf
ASR Lecture 13 Recurrent Network Acoustic Models 21
Recurrent network unfolded in time
Hidden (t)
Output (t)
Input (t)
Hidden (t-1)
w(1)
w(2)
w(R)
Hidden (t+1)
Input (t-1)
Output (t-1)
Input (t+1)
Output (t+1)
w(2)w(2)
w(1)w(1)
w(R)w(R) w(R)
An RNN for a sequence of T inputs can be viewed as a deepT -layer network with shared weights
We can train an RNN by doing backprop through thisunfolded network making sure we share the weightsWeight sharing
if two weights are constrained to be equal (w1 = w2) then theywill stay equal if the weight changes are equal(partEpartw1 = partEpartw2)achieve this by updating with (partEpartw1 + partEpartw2) (cf ConvNets)
ASR Lecture 13 Recurrent Network Acoustic Models 6
Back-propagation through time (BPTT)
We can train a network by unfolding and back-propagatingthrough time summing the derivatives for each weight as wego through the sequence
More efficiently run as a recurrent network
cache the unit outputs at each timestepcache the output errors at each timestepthen backprop from the final timestep to zero computing thederivatives at each stepcompute the weight updates by summing the derivatives acrosstime
Expensive ndash backprop for a 1000 item sequence equivalent toa 1000-layer feed-forward network
Truncated BPTT ndash backprop through just a few time steps(eg 20)
ASR Lecture 13 Recurrent Network Acoustic Models 7
Vanishing and exploding gradients
BPTT involves taking the product of many gradients (as in avery deep network) ndash this can lead to vanishing (componentgradients less than 1) or exploding (greater than 1) gradients
This can prevent effective training
Modified optimisation algorithms
RMSProp (normalise the gradient for each weight by averageof it magnitude learning rate for each weight)Hessian-free ndash an approximation to second-order approacheswhich use curvature information
Modified hidden unit transfer functionsLong short term memory (LSTM)
Linear self-recurrence for each hidden unit (long-term memory)Gates - dynamic weights which are a function of the inputs
ReLUs
ASR Lecture 13 Recurrent Network Acoustic Models 8
Recurrent networks in speech recognition
1990s ndash Hybrid RNNHMM speech recognition (Robinson etal)
2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)
2013 onwards ndash RNNLSTM models at state of the art foracoustic modelling (Graves Sak et al)
2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)
ASR Lecture 13 Recurrent Network Acoustic Models 9
Recurrent networks in speech recognition
1990s ndash Hybrid RNNHMM speech recognition(Robinson et al)
2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)
2013 onwards ndash RNNLSTM models at state of the artfor acoustic modelling (Graves Sak et al)
2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)
ASR Lecture 13 Recurrent Network Acoustic Models 9
Speech recognition with recurrent networks (1990s)
time (ms)
freq (
Hz)
0 200 400 600 800 1000 1200 14000
2000
4000
6000
8000
RecurrentNeuralNetwork
SpeechAcoustics
Phoneme Probabilities
Robinson et al (1996)
ASR Lecture 13 Recurrent Network Acoustic Models 10
1990s RNN
Features
MEL+ Filter-bank outputs + voicing parameters (23featuresframe)PLP cepstral coefficients (13 features)Coefficients normalized to zero mean and unit variance
256 hidden (state) units
79 context-independent phone classes (outputs)
Output training target delayed by 5 frames
Use RNN scaled likelihoods in hybrid RNNHMM
About 100k trainable parameters
Trained using stochastic BPTT (using method similar toRpropRMSprop) on WSJ0 (3M training examples)
In 1994 training took five days on a specially designed parallelcomputer (the RAP)
ASR Lecture 13 Recurrent Network Acoustic Models 11
Combined RNN system
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
Mel +
PLP
Mel +
PLP
sh ow
iym
Speech waveform Preprocessor Recurrent net Markov mode
Individual systems WER of 14ndash15 on WSJ ldquospoke 6 datardquoInterpolated in log domain WER=11(Best context dependent GMMHMM system in 1995 WER=8)
ASR Lecture 13 Recurrent Network Acoustic Models 12
LSTM v1
I(t) H(t-1)
H(t)
BasicSigmoid
Unit
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
LinearRecurrence
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
OutputGate
InputGate
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v2
I(t) H(t-1)
H(t)
ForgetGate
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471
ASR Lecture 13 Recurrent Network Acoustic Models 14
LSTM v3
I(t) H(t-1)
H(t)
C(t)
C(t-1)
PeepholeConnections
ASR Lecture 13 Recurrent Network Acoustic Models 15
LSTM equations
2 LSTM ARCHITECTURES
In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows
W = nc nc 4 + ni nc 4 + nc no + nc 3
where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive
As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units
An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = Wymmt + by (6)
where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product
inpu
t
g ct1 h
it
ft
ct
ot recu
rren
tpr
ojec
tion
outp
ut
xt
mt
pt
rt
rt1
yt
memory blocks
Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity
of the vectors and g and h are the cell input and cell output activationfunctions generally tanh
With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows
it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)
ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)
mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)
yt = Wyrrt + Wyppt + by (14)(15)
where the r and p denote the recurrent and optional non-recurrentunit activations
21 Implementation
We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization
We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively
ASR Lecture 13 Recurrent Network Acoustic Models 16
Google LSTM experiments (2014ndashdate)
Google voice search data ndash 1900h training set
Input features 40-d log mel filterbank energies
No input context single frame 40-d frame presented eachtimestep
LSTM networks
400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states
Delay output targets by 5 frames
Use RNN to estimated scaled likelihoods
Train using BPTT
ASR Lecture 13 Recurrent Network Acoustic Models 17
LSTM acoustic modelling architectures
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 18
LSTM with projection layer (LSTMP)
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 19
LSTM ASR Results (Google Voice Search task)
Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters
C P Depth N Dev Train WER() () ()
840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124
6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107
2048 512 1L 13M 673 718 113
where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc
By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer
With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added
rt = Wrmmt (7)yt = (Wyrrt + by) (8)
where the r denote the recurrent unit activations
24 Deep LSTMP
Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model
3 Distributed Training Scaling up toLarge Models with Parallelization
We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized
instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization
We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes
This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism
4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies
41 Systems amp Evaluation
All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value
(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20
Reading
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798
1735VtN93Mehb5k
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162
089976600300015015VtN9ncehb5k
T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http
wwwcstredacukdownloadspublications1996rnn4csr96pdf
H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf
ASR Lecture 13 Recurrent Network Acoustic Models 21
Back-propagation through time (BPTT)
We can train a network by unfolding and back-propagatingthrough time summing the derivatives for each weight as wego through the sequence
More efficiently run as a recurrent network
cache the unit outputs at each timestepcache the output errors at each timestepthen backprop from the final timestep to zero computing thederivatives at each stepcompute the weight updates by summing the derivatives acrosstime
Expensive ndash backprop for a 1000 item sequence equivalent toa 1000-layer feed-forward network
Truncated BPTT ndash backprop through just a few time steps(eg 20)
ASR Lecture 13 Recurrent Network Acoustic Models 7
Vanishing and exploding gradients
BPTT involves taking the product of many gradients (as in avery deep network) ndash this can lead to vanishing (componentgradients less than 1) or exploding (greater than 1) gradients
This can prevent effective training
Modified optimisation algorithms
RMSProp (normalise the gradient for each weight by averageof it magnitude learning rate for each weight)Hessian-free ndash an approximation to second-order approacheswhich use curvature information
Modified hidden unit transfer functionsLong short term memory (LSTM)
Linear self-recurrence for each hidden unit (long-term memory)Gates - dynamic weights which are a function of the inputs
ReLUs
ASR Lecture 13 Recurrent Network Acoustic Models 8
Recurrent networks in speech recognition
1990s ndash Hybrid RNNHMM speech recognition (Robinson etal)
2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)
2013 onwards ndash RNNLSTM models at state of the art foracoustic modelling (Graves Sak et al)
2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)
ASR Lecture 13 Recurrent Network Acoustic Models 9
Recurrent networks in speech recognition
1990s ndash Hybrid RNNHMM speech recognition(Robinson et al)
2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)
2013 onwards ndash RNNLSTM models at state of the artfor acoustic modelling (Graves Sak et al)
2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)
ASR Lecture 13 Recurrent Network Acoustic Models 9
Speech recognition with recurrent networks (1990s)
time (ms)
freq (
Hz)
0 200 400 600 800 1000 1200 14000
2000
4000
6000
8000
RecurrentNeuralNetwork
SpeechAcoustics
Phoneme Probabilities
Robinson et al (1996)
ASR Lecture 13 Recurrent Network Acoustic Models 10
1990s RNN
Features
MEL+ Filter-bank outputs + voicing parameters (23featuresframe)PLP cepstral coefficients (13 features)Coefficients normalized to zero mean and unit variance
256 hidden (state) units
79 context-independent phone classes (outputs)
Output training target delayed by 5 frames
Use RNN scaled likelihoods in hybrid RNNHMM
About 100k trainable parameters
Trained using stochastic BPTT (using method similar toRpropRMSprop) on WSJ0 (3M training examples)
In 1994 training took five days on a specially designed parallelcomputer (the RAP)
ASR Lecture 13 Recurrent Network Acoustic Models 11
Combined RNN system
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
Mel +
PLP
Mel +
PLP
sh ow
iym
Speech waveform Preprocessor Recurrent net Markov mode
Individual systems WER of 14ndash15 on WSJ ldquospoke 6 datardquoInterpolated in log domain WER=11(Best context dependent GMMHMM system in 1995 WER=8)
ASR Lecture 13 Recurrent Network Acoustic Models 12
LSTM v1
I(t) H(t-1)
H(t)
BasicSigmoid
Unit
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
LinearRecurrence
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
OutputGate
InputGate
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v2
I(t) H(t-1)
H(t)
ForgetGate
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471
ASR Lecture 13 Recurrent Network Acoustic Models 14
LSTM v3
I(t) H(t-1)
H(t)
C(t)
C(t-1)
PeepholeConnections
ASR Lecture 13 Recurrent Network Acoustic Models 15
LSTM equations
2 LSTM ARCHITECTURES
In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows
W = nc nc 4 + ni nc 4 + nc no + nc 3
where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive
As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units
An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = Wymmt + by (6)
where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product
inpu
t
g ct1 h
it
ft
ct
ot recu
rren
tpr
ojec
tion
outp
ut
xt
mt
pt
rt
rt1
yt
memory blocks
Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity
of the vectors and g and h are the cell input and cell output activationfunctions generally tanh
With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows
it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)
ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)
mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)
yt = Wyrrt + Wyppt + by (14)(15)
where the r and p denote the recurrent and optional non-recurrentunit activations
21 Implementation
We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization
We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively
ASR Lecture 13 Recurrent Network Acoustic Models 16
Google LSTM experiments (2014ndashdate)
Google voice search data ndash 1900h training set
Input features 40-d log mel filterbank energies
No input context single frame 40-d frame presented eachtimestep
LSTM networks
400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states
Delay output targets by 5 frames
Use RNN to estimated scaled likelihoods
Train using BPTT
ASR Lecture 13 Recurrent Network Acoustic Models 17
LSTM acoustic modelling architectures
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 18
LSTM with projection layer (LSTMP)
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 19
LSTM ASR Results (Google Voice Search task)
Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters
C P Depth N Dev Train WER() () ()
840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124
6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107
2048 512 1L 13M 673 718 113
where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc
By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer
With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added
rt = Wrmmt (7)yt = (Wyrrt + by) (8)
where the r denote the recurrent unit activations
24 Deep LSTMP
Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model
3 Distributed Training Scaling up toLarge Models with Parallelization
We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized
instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization
We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes
This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism
4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies
41 Systems amp Evaluation
All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value
(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20
Reading
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798
1735VtN93Mehb5k
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162
089976600300015015VtN9ncehb5k
T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http
wwwcstredacukdownloadspublications1996rnn4csr96pdf
H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf
ASR Lecture 13 Recurrent Network Acoustic Models 21
Vanishing and exploding gradients
BPTT involves taking the product of many gradients (as in avery deep network) ndash this can lead to vanishing (componentgradients less than 1) or exploding (greater than 1) gradients
This can prevent effective training
Modified optimisation algorithms
RMSProp (normalise the gradient for each weight by averageof it magnitude learning rate for each weight)Hessian-free ndash an approximation to second-order approacheswhich use curvature information
Modified hidden unit transfer functionsLong short term memory (LSTM)
Linear self-recurrence for each hidden unit (long-term memory)Gates - dynamic weights which are a function of the inputs
ReLUs
ASR Lecture 13 Recurrent Network Acoustic Models 8
Recurrent networks in speech recognition
1990s ndash Hybrid RNNHMM speech recognition (Robinson etal)
2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)
2013 onwards ndash RNNLSTM models at state of the art foracoustic modelling (Graves Sak et al)
2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)
ASR Lecture 13 Recurrent Network Acoustic Models 9
Recurrent networks in speech recognition
1990s ndash Hybrid RNNHMM speech recognition(Robinson et al)
2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)
2013 onwards ndash RNNLSTM models at state of the artfor acoustic modelling (Graves Sak et al)
2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)
ASR Lecture 13 Recurrent Network Acoustic Models 9
Speech recognition with recurrent networks (1990s)
time (ms)
freq (
Hz)
0 200 400 600 800 1000 1200 14000
2000
4000
6000
8000
RecurrentNeuralNetwork
SpeechAcoustics
Phoneme Probabilities
Robinson et al (1996)
ASR Lecture 13 Recurrent Network Acoustic Models 10
1990s RNN
Features
MEL+ Filter-bank outputs + voicing parameters (23featuresframe)PLP cepstral coefficients (13 features)Coefficients normalized to zero mean and unit variance
256 hidden (state) units
79 context-independent phone classes (outputs)
Output training target delayed by 5 frames
Use RNN scaled likelihoods in hybrid RNNHMM
About 100k trainable parameters
Trained using stochastic BPTT (using method similar toRpropRMSprop) on WSJ0 (3M training examples)
In 1994 training took five days on a specially designed parallelcomputer (the RAP)
ASR Lecture 13 Recurrent Network Acoustic Models 11
Combined RNN system
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
Mel +
PLP
Mel +
PLP
sh ow
iym
Speech waveform Preprocessor Recurrent net Markov mode
Individual systems WER of 14ndash15 on WSJ ldquospoke 6 datardquoInterpolated in log domain WER=11(Best context dependent GMMHMM system in 1995 WER=8)
ASR Lecture 13 Recurrent Network Acoustic Models 12
LSTM v1
I(t) H(t-1)
H(t)
BasicSigmoid
Unit
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
LinearRecurrence
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
OutputGate
InputGate
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v2
I(t) H(t-1)
H(t)
ForgetGate
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471
ASR Lecture 13 Recurrent Network Acoustic Models 14
LSTM v3
I(t) H(t-1)
H(t)
C(t)
C(t-1)
PeepholeConnections
ASR Lecture 13 Recurrent Network Acoustic Models 15
LSTM equations
2 LSTM ARCHITECTURES
In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows
W = nc nc 4 + ni nc 4 + nc no + nc 3
where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive
As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units
An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = Wymmt + by (6)
where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product
inpu
t
g ct1 h
it
ft
ct
ot recu
rren
tpr
ojec
tion
outp
ut
xt
mt
pt
rt
rt1
yt
memory blocks
Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity
of the vectors and g and h are the cell input and cell output activationfunctions generally tanh
With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows
it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)
ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)
mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)
yt = Wyrrt + Wyppt + by (14)(15)
where the r and p denote the recurrent and optional non-recurrentunit activations
21 Implementation
We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization
We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively
ASR Lecture 13 Recurrent Network Acoustic Models 16
Google LSTM experiments (2014ndashdate)
Google voice search data ndash 1900h training set
Input features 40-d log mel filterbank energies
No input context single frame 40-d frame presented eachtimestep
LSTM networks
400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states
Delay output targets by 5 frames
Use RNN to estimated scaled likelihoods
Train using BPTT
ASR Lecture 13 Recurrent Network Acoustic Models 17
LSTM acoustic modelling architectures
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 18
LSTM with projection layer (LSTMP)
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 19
LSTM ASR Results (Google Voice Search task)
Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters
C P Depth N Dev Train WER() () ()
840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124
6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107
2048 512 1L 13M 673 718 113
where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc
By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer
With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added
rt = Wrmmt (7)yt = (Wyrrt + by) (8)
where the r denote the recurrent unit activations
24 Deep LSTMP
Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model
3 Distributed Training Scaling up toLarge Models with Parallelization
We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized
instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization
We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes
This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism
4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies
41 Systems amp Evaluation
All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value
(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20
Reading
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798
1735VtN93Mehb5k
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162
089976600300015015VtN9ncehb5k
T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http
wwwcstredacukdownloadspublications1996rnn4csr96pdf
H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf
ASR Lecture 13 Recurrent Network Acoustic Models 21
Recurrent networks in speech recognition
1990s ndash Hybrid RNNHMM speech recognition (Robinson etal)
2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)
2013 onwards ndash RNNLSTM models at state of the art foracoustic modelling (Graves Sak et al)
2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)
ASR Lecture 13 Recurrent Network Acoustic Models 9
Recurrent networks in speech recognition
1990s ndash Hybrid RNNHMM speech recognition(Robinson et al)
2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)
2013 onwards ndash RNNLSTM models at state of the artfor acoustic modelling (Graves Sak et al)
2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)
ASR Lecture 13 Recurrent Network Acoustic Models 9
Speech recognition with recurrent networks (1990s)
time (ms)
freq (
Hz)
0 200 400 600 800 1000 1200 14000
2000
4000
6000
8000
RecurrentNeuralNetwork
SpeechAcoustics
Phoneme Probabilities
Robinson et al (1996)
ASR Lecture 13 Recurrent Network Acoustic Models 10
1990s RNN
Features
MEL+ Filter-bank outputs + voicing parameters (23featuresframe)PLP cepstral coefficients (13 features)Coefficients normalized to zero mean and unit variance
256 hidden (state) units
79 context-independent phone classes (outputs)
Output training target delayed by 5 frames
Use RNN scaled likelihoods in hybrid RNNHMM
About 100k trainable parameters
Trained using stochastic BPTT (using method similar toRpropRMSprop) on WSJ0 (3M training examples)
In 1994 training took five days on a specially designed parallelcomputer (the RAP)
ASR Lecture 13 Recurrent Network Acoustic Models 11
Combined RNN system
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
Mel +
PLP
Mel +
PLP
sh ow
iym
Speech waveform Preprocessor Recurrent net Markov mode
Individual systems WER of 14ndash15 on WSJ ldquospoke 6 datardquoInterpolated in log domain WER=11(Best context dependent GMMHMM system in 1995 WER=8)
ASR Lecture 13 Recurrent Network Acoustic Models 12
LSTM v1
I(t) H(t-1)
H(t)
BasicSigmoid
Unit
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
LinearRecurrence
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
OutputGate
InputGate
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v2
I(t) H(t-1)
H(t)
ForgetGate
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471
ASR Lecture 13 Recurrent Network Acoustic Models 14
LSTM v3
I(t) H(t-1)
H(t)
C(t)
C(t-1)
PeepholeConnections
ASR Lecture 13 Recurrent Network Acoustic Models 15
LSTM equations
2 LSTM ARCHITECTURES
In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows
W = nc nc 4 + ni nc 4 + nc no + nc 3
where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive
As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units
An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = Wymmt + by (6)
where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product
inpu
t
g ct1 h
it
ft
ct
ot recu
rren
tpr
ojec
tion
outp
ut
xt
mt
pt
rt
rt1
yt
memory blocks
Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity
of the vectors and g and h are the cell input and cell output activationfunctions generally tanh
With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows
it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)
ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)
mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)
yt = Wyrrt + Wyppt + by (14)(15)
where the r and p denote the recurrent and optional non-recurrentunit activations
21 Implementation
We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization
We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively
ASR Lecture 13 Recurrent Network Acoustic Models 16
Google LSTM experiments (2014ndashdate)
Google voice search data ndash 1900h training set
Input features 40-d log mel filterbank energies
No input context single frame 40-d frame presented eachtimestep
LSTM networks
400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states
Delay output targets by 5 frames
Use RNN to estimated scaled likelihoods
Train using BPTT
ASR Lecture 13 Recurrent Network Acoustic Models 17
LSTM acoustic modelling architectures
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 18
LSTM with projection layer (LSTMP)
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 19
LSTM ASR Results (Google Voice Search task)
Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters
C P Depth N Dev Train WER() () ()
840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124
6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107
2048 512 1L 13M 673 718 113
where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc
By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer
With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added
rt = Wrmmt (7)yt = (Wyrrt + by) (8)
where the r denote the recurrent unit activations
24 Deep LSTMP
Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model
3 Distributed Training Scaling up toLarge Models with Parallelization
We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized
instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization
We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes
This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism
4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies
41 Systems amp Evaluation
All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value
(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20
Reading
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798
1735VtN93Mehb5k
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162
089976600300015015VtN9ncehb5k
T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http
wwwcstredacukdownloadspublications1996rnn4csr96pdf
H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf
ASR Lecture 13 Recurrent Network Acoustic Models 21
Recurrent networks in speech recognition
1990s ndash Hybrid RNNHMM speech recognition(Robinson et al)
2009 onwards ndash RNN language models (Mikolov ndash new stateof the art next weekrsquos lecture)
2013 onwards ndash RNNLSTM models at state of the artfor acoustic modelling (Graves Sak et al)
2015 onwards ndash RNN sequence modelling also replaces theHMM ndash ldquoHMM-freerdquo ASR (next week also)
ASR Lecture 13 Recurrent Network Acoustic Models 9
Speech recognition with recurrent networks (1990s)
time (ms)
freq (
Hz)
0 200 400 600 800 1000 1200 14000
2000
4000
6000
8000
RecurrentNeuralNetwork
SpeechAcoustics
Phoneme Probabilities
Robinson et al (1996)
ASR Lecture 13 Recurrent Network Acoustic Models 10
1990s RNN
Features
MEL+ Filter-bank outputs + voicing parameters (23featuresframe)PLP cepstral coefficients (13 features)Coefficients normalized to zero mean and unit variance
256 hidden (state) units
79 context-independent phone classes (outputs)
Output training target delayed by 5 frames
Use RNN scaled likelihoods in hybrid RNNHMM
About 100k trainable parameters
Trained using stochastic BPTT (using method similar toRpropRMSprop) on WSJ0 (3M training examples)
In 1994 training took five days on a specially designed parallelcomputer (the RAP)
ASR Lecture 13 Recurrent Network Acoustic Models 11
Combined RNN system
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
Mel +
PLP
Mel +
PLP
sh ow
iym
Speech waveform Preprocessor Recurrent net Markov mode
Individual systems WER of 14ndash15 on WSJ ldquospoke 6 datardquoInterpolated in log domain WER=11(Best context dependent GMMHMM system in 1995 WER=8)
ASR Lecture 13 Recurrent Network Acoustic Models 12
LSTM v1
I(t) H(t-1)
H(t)
BasicSigmoid
Unit
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
LinearRecurrence
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
OutputGate
InputGate
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v2
I(t) H(t-1)
H(t)
ForgetGate
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471
ASR Lecture 13 Recurrent Network Acoustic Models 14
LSTM v3
I(t) H(t-1)
H(t)
C(t)
C(t-1)
PeepholeConnections
ASR Lecture 13 Recurrent Network Acoustic Models 15
LSTM equations
2 LSTM ARCHITECTURES
In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows
W = nc nc 4 + ni nc 4 + nc no + nc 3
where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive
As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units
An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = Wymmt + by (6)
where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product
inpu
t
g ct1 h
it
ft
ct
ot recu
rren
tpr
ojec
tion
outp
ut
xt
mt
pt
rt
rt1
yt
memory blocks
Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity
of the vectors and g and h are the cell input and cell output activationfunctions generally tanh
With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows
it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)
ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)
mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)
yt = Wyrrt + Wyppt + by (14)(15)
where the r and p denote the recurrent and optional non-recurrentunit activations
21 Implementation
We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization
We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively
ASR Lecture 13 Recurrent Network Acoustic Models 16
Google LSTM experiments (2014ndashdate)
Google voice search data ndash 1900h training set
Input features 40-d log mel filterbank energies
No input context single frame 40-d frame presented eachtimestep
LSTM networks
400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states
Delay output targets by 5 frames
Use RNN to estimated scaled likelihoods
Train using BPTT
ASR Lecture 13 Recurrent Network Acoustic Models 17
LSTM acoustic modelling architectures
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 18
LSTM with projection layer (LSTMP)
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 19
LSTM ASR Results (Google Voice Search task)
Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters
C P Depth N Dev Train WER() () ()
840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124
6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107
2048 512 1L 13M 673 718 113
where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc
By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer
With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added
rt = Wrmmt (7)yt = (Wyrrt + by) (8)
where the r denote the recurrent unit activations
24 Deep LSTMP
Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model
3 Distributed Training Scaling up toLarge Models with Parallelization
We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized
instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization
We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes
This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism
4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies
41 Systems amp Evaluation
All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value
(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20
Reading
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798
1735VtN93Mehb5k
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162
089976600300015015VtN9ncehb5k
T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http
wwwcstredacukdownloadspublications1996rnn4csr96pdf
H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf
ASR Lecture 13 Recurrent Network Acoustic Models 21
Speech recognition with recurrent networks (1990s)
time (ms)
freq (
Hz)
0 200 400 600 800 1000 1200 14000
2000
4000
6000
8000
RecurrentNeuralNetwork
SpeechAcoustics
Phoneme Probabilities
Robinson et al (1996)
ASR Lecture 13 Recurrent Network Acoustic Models 10
1990s RNN
Features
MEL+ Filter-bank outputs + voicing parameters (23featuresframe)PLP cepstral coefficients (13 features)Coefficients normalized to zero mean and unit variance
256 hidden (state) units
79 context-independent phone classes (outputs)
Output training target delayed by 5 frames
Use RNN scaled likelihoods in hybrid RNNHMM
About 100k trainable parameters
Trained using stochastic BPTT (using method similar toRpropRMSprop) on WSJ0 (3M training examples)
In 1994 training took five days on a specially designed parallelcomputer (the RAP)
ASR Lecture 13 Recurrent Network Acoustic Models 11
Combined RNN system
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
Mel +
PLP
Mel +
PLP
sh ow
iym
Speech waveform Preprocessor Recurrent net Markov mode
Individual systems WER of 14ndash15 on WSJ ldquospoke 6 datardquoInterpolated in log domain WER=11(Best context dependent GMMHMM system in 1995 WER=8)
ASR Lecture 13 Recurrent Network Acoustic Models 12
LSTM v1
I(t) H(t-1)
H(t)
BasicSigmoid
Unit
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
LinearRecurrence
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
OutputGate
InputGate
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v2
I(t) H(t-1)
H(t)
ForgetGate
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471
ASR Lecture 13 Recurrent Network Acoustic Models 14
LSTM v3
I(t) H(t-1)
H(t)
C(t)
C(t-1)
PeepholeConnections
ASR Lecture 13 Recurrent Network Acoustic Models 15
LSTM equations
2 LSTM ARCHITECTURES
In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows
W = nc nc 4 + ni nc 4 + nc no + nc 3
where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive
As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units
An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = Wymmt + by (6)
where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product
inpu
t
g ct1 h
it
ft
ct
ot recu
rren
tpr
ojec
tion
outp
ut
xt
mt
pt
rt
rt1
yt
memory blocks
Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity
of the vectors and g and h are the cell input and cell output activationfunctions generally tanh
With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows
it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)
ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)
mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)
yt = Wyrrt + Wyppt + by (14)(15)
where the r and p denote the recurrent and optional non-recurrentunit activations
21 Implementation
We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization
We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively
ASR Lecture 13 Recurrent Network Acoustic Models 16
Google LSTM experiments (2014ndashdate)
Google voice search data ndash 1900h training set
Input features 40-d log mel filterbank energies
No input context single frame 40-d frame presented eachtimestep
LSTM networks
400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states
Delay output targets by 5 frames
Use RNN to estimated scaled likelihoods
Train using BPTT
ASR Lecture 13 Recurrent Network Acoustic Models 17
LSTM acoustic modelling architectures
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 18
LSTM with projection layer (LSTMP)
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 19
LSTM ASR Results (Google Voice Search task)
Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters
C P Depth N Dev Train WER() () ()
840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124
6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107
2048 512 1L 13M 673 718 113
where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc
By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer
With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added
rt = Wrmmt (7)yt = (Wyrrt + by) (8)
where the r denote the recurrent unit activations
24 Deep LSTMP
Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model
3 Distributed Training Scaling up toLarge Models with Parallelization
We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized
instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization
We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes
This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism
4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies
41 Systems amp Evaluation
All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value
(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20
Reading
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798
1735VtN93Mehb5k
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162
089976600300015015VtN9ncehb5k
T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http
wwwcstredacukdownloadspublications1996rnn4csr96pdf
H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf
ASR Lecture 13 Recurrent Network Acoustic Models 21
1990s RNN
Features
MEL+ Filter-bank outputs + voicing parameters (23featuresframe)PLP cepstral coefficients (13 features)Coefficients normalized to zero mean and unit variance
256 hidden (state) units
79 context-independent phone classes (outputs)
Output training target delayed by 5 frames
Use RNN scaled likelihoods in hybrid RNNHMM
About 100k trainable parameters
Trained using stochastic BPTT (using method similar toRpropRMSprop) on WSJ0 (3M training examples)
In 1994 training took five days on a specially designed parallelcomputer (the RAP)
ASR Lecture 13 Recurrent Network Acoustic Models 11
Combined RNN system
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
Mel +
PLP
Mel +
PLP
sh ow
iym
Speech waveform Preprocessor Recurrent net Markov mode
Individual systems WER of 14ndash15 on WSJ ldquospoke 6 datardquoInterpolated in log domain WER=11(Best context dependent GMMHMM system in 1995 WER=8)
ASR Lecture 13 Recurrent Network Acoustic Models 12
LSTM v1
I(t) H(t-1)
H(t)
BasicSigmoid
Unit
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
LinearRecurrence
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
OutputGate
InputGate
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v2
I(t) H(t-1)
H(t)
ForgetGate
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471
ASR Lecture 13 Recurrent Network Acoustic Models 14
LSTM v3
I(t) H(t-1)
H(t)
C(t)
C(t-1)
PeepholeConnections
ASR Lecture 13 Recurrent Network Acoustic Models 15
LSTM equations
2 LSTM ARCHITECTURES
In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows
W = nc nc 4 + ni nc 4 + nc no + nc 3
where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive
As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units
An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = Wymmt + by (6)
where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product
inpu
t
g ct1 h
it
ft
ct
ot recu
rren
tpr
ojec
tion
outp
ut
xt
mt
pt
rt
rt1
yt
memory blocks
Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity
of the vectors and g and h are the cell input and cell output activationfunctions generally tanh
With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows
it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)
ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)
mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)
yt = Wyrrt + Wyppt + by (14)(15)
where the r and p denote the recurrent and optional non-recurrentunit activations
21 Implementation
We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization
We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively
ASR Lecture 13 Recurrent Network Acoustic Models 16
Google LSTM experiments (2014ndashdate)
Google voice search data ndash 1900h training set
Input features 40-d log mel filterbank energies
No input context single frame 40-d frame presented eachtimestep
LSTM networks
400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states
Delay output targets by 5 frames
Use RNN to estimated scaled likelihoods
Train using BPTT
ASR Lecture 13 Recurrent Network Acoustic Models 17
LSTM acoustic modelling architectures
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 18
LSTM with projection layer (LSTMP)
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 19
LSTM ASR Results (Google Voice Search task)
Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters
C P Depth N Dev Train WER() () ()
840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124
6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107
2048 512 1L 13M 673 718 113
where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc
By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer
With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added
rt = Wrmmt (7)yt = (Wyrrt + by) (8)
where the r denote the recurrent unit activations
24 Deep LSTMP
Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model
3 Distributed Training Scaling up toLarge Models with Parallelization
We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized
instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization
We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes
This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism
4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies
41 Systems amp Evaluation
All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value
(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20
Reading
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798
1735VtN93Mehb5k
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162
089976600300015015VtN9ncehb5k
T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http
wwwcstredacukdownloadspublications1996rnn4csr96pdf
H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf
ASR Lecture 13 Recurrent Network Acoustic Models 21
Combined RNN system
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
u(t)
x(t+1)x(t)
Timedelay
y(t-4)
Mel +
PLP
Mel +
PLP
sh ow
iym
Speech waveform Preprocessor Recurrent net Markov mode
Individual systems WER of 14ndash15 on WSJ ldquospoke 6 datardquoInterpolated in log domain WER=11(Best context dependent GMMHMM system in 1995 WER=8)
ASR Lecture 13 Recurrent Network Acoustic Models 12
LSTM v1
I(t) H(t-1)
H(t)
BasicSigmoid
Unit
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
LinearRecurrence
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
OutputGate
InputGate
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v2
I(t) H(t-1)
H(t)
ForgetGate
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471
ASR Lecture 13 Recurrent Network Acoustic Models 14
LSTM v3
I(t) H(t-1)
H(t)
C(t)
C(t-1)
PeepholeConnections
ASR Lecture 13 Recurrent Network Acoustic Models 15
LSTM equations
2 LSTM ARCHITECTURES
In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows
W = nc nc 4 + ni nc 4 + nc no + nc 3
where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive
As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units
An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = Wymmt + by (6)
where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product
inpu
t
g ct1 h
it
ft
ct
ot recu
rren
tpr
ojec
tion
outp
ut
xt
mt
pt
rt
rt1
yt
memory blocks
Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity
of the vectors and g and h are the cell input and cell output activationfunctions generally tanh
With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows
it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)
ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)
mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)
yt = Wyrrt + Wyppt + by (14)(15)
where the r and p denote the recurrent and optional non-recurrentunit activations
21 Implementation
We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization
We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively
ASR Lecture 13 Recurrent Network Acoustic Models 16
Google LSTM experiments (2014ndashdate)
Google voice search data ndash 1900h training set
Input features 40-d log mel filterbank energies
No input context single frame 40-d frame presented eachtimestep
LSTM networks
400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states
Delay output targets by 5 frames
Use RNN to estimated scaled likelihoods
Train using BPTT
ASR Lecture 13 Recurrent Network Acoustic Models 17
LSTM acoustic modelling architectures
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 18
LSTM with projection layer (LSTMP)
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 19
LSTM ASR Results (Google Voice Search task)
Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters
C P Depth N Dev Train WER() () ()
840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124
6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107
2048 512 1L 13M 673 718 113
where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc
By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer
With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added
rt = Wrmmt (7)yt = (Wyrrt + by) (8)
where the r denote the recurrent unit activations
24 Deep LSTMP
Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model
3 Distributed Training Scaling up toLarge Models with Parallelization
We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized
instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization
We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes
This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism
4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies
41 Systems amp Evaluation
All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value
(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20
Reading
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798
1735VtN93Mehb5k
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162
089976600300015015VtN9ncehb5k
T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http
wwwcstredacukdownloadspublications1996rnn4csr96pdf
H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf
ASR Lecture 13 Recurrent Network Acoustic Models 21
LSTM v1
I(t) H(t-1)
H(t)
BasicSigmoid
Unit
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
LinearRecurrence
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
OutputGate
InputGate
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v2
I(t) H(t-1)
H(t)
ForgetGate
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471
ASR Lecture 13 Recurrent Network Acoustic Models 14
LSTM v3
I(t) H(t-1)
H(t)
C(t)
C(t-1)
PeepholeConnections
ASR Lecture 13 Recurrent Network Acoustic Models 15
LSTM equations
2 LSTM ARCHITECTURES
In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows
W = nc nc 4 + ni nc 4 + nc no + nc 3
where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive
As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units
An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = Wymmt + by (6)
where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product
inpu
t
g ct1 h
it
ft
ct
ot recu
rren
tpr
ojec
tion
outp
ut
xt
mt
pt
rt
rt1
yt
memory blocks
Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity
of the vectors and g and h are the cell input and cell output activationfunctions generally tanh
With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows
it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)
ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)
mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)
yt = Wyrrt + Wyppt + by (14)(15)
where the r and p denote the recurrent and optional non-recurrentunit activations
21 Implementation
We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization
We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively
ASR Lecture 13 Recurrent Network Acoustic Models 16
Google LSTM experiments (2014ndashdate)
Google voice search data ndash 1900h training set
Input features 40-d log mel filterbank energies
No input context single frame 40-d frame presented eachtimestep
LSTM networks
400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states
Delay output targets by 5 frames
Use RNN to estimated scaled likelihoods
Train using BPTT
ASR Lecture 13 Recurrent Network Acoustic Models 17
LSTM acoustic modelling architectures
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 18
LSTM with projection layer (LSTMP)
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 19
LSTM ASR Results (Google Voice Search task)
Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters
C P Depth N Dev Train WER() () ()
840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124
6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107
2048 512 1L 13M 673 718 113
where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc
By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer
With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added
rt = Wrmmt (7)yt = (Wyrrt + by) (8)
where the r denote the recurrent unit activations
24 Deep LSTMP
Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model
3 Distributed Training Scaling up toLarge Models with Parallelization
We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized
instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization
We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes
This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism
4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies
41 Systems amp Evaluation
All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value
(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20
Reading
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798
1735VtN93Mehb5k
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162
089976600300015015VtN9ncehb5k
T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http
wwwcstredacukdownloadspublications1996rnn4csr96pdf
H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf
ASR Lecture 13 Recurrent Network Acoustic Models 21
LSTM v1
I(t) H(t-1)
H(t)
1
LinearRecurrence
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v1
I(t) H(t-1)
H(t)
1
OutputGate
InputGate
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v2
I(t) H(t-1)
H(t)
ForgetGate
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471
ASR Lecture 13 Recurrent Network Acoustic Models 14
LSTM v3
I(t) H(t-1)
H(t)
C(t)
C(t-1)
PeepholeConnections
ASR Lecture 13 Recurrent Network Acoustic Models 15
LSTM equations
2 LSTM ARCHITECTURES
In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows
W = nc nc 4 + ni nc 4 + nc no + nc 3
where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive
As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units
An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = Wymmt + by (6)
where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product
inpu
t
g ct1 h
it
ft
ct
ot recu
rren
tpr
ojec
tion
outp
ut
xt
mt
pt
rt
rt1
yt
memory blocks
Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity
of the vectors and g and h are the cell input and cell output activationfunctions generally tanh
With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows
it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)
ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)
mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)
yt = Wyrrt + Wyppt + by (14)(15)
where the r and p denote the recurrent and optional non-recurrentunit activations
21 Implementation
We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization
We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively
ASR Lecture 13 Recurrent Network Acoustic Models 16
Google LSTM experiments (2014ndashdate)
Google voice search data ndash 1900h training set
Input features 40-d log mel filterbank energies
No input context single frame 40-d frame presented eachtimestep
LSTM networks
400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states
Delay output targets by 5 frames
Use RNN to estimated scaled likelihoods
Train using BPTT
ASR Lecture 13 Recurrent Network Acoustic Models 17
LSTM acoustic modelling architectures
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 18
LSTM with projection layer (LSTMP)
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 19
LSTM ASR Results (Google Voice Search task)
Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters
C P Depth N Dev Train WER() () ()
840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124
6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107
2048 512 1L 13M 673 718 113
where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc
By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer
With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added
rt = Wrmmt (7)yt = (Wyrrt + by) (8)
where the r denote the recurrent unit activations
24 Deep LSTMP
Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model
3 Distributed Training Scaling up toLarge Models with Parallelization
We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized
instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization
We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes
This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism
4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies
41 Systems amp Evaluation
All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value
(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20
Reading
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798
1735VtN93Mehb5k
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162
089976600300015015VtN9ncehb5k
T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http
wwwcstredacukdownloadspublications1996rnn4csr96pdf
H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf
ASR Lecture 13 Recurrent Network Acoustic Models 21
LSTM v1
I(t) H(t-1)
H(t)
1
OutputGate
InputGate
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780
ASR Lecture 13 Recurrent Network Acoustic Models 13
LSTM v2
I(t) H(t-1)
H(t)
ForgetGate
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471
ASR Lecture 13 Recurrent Network Acoustic Models 14
LSTM v3
I(t) H(t-1)
H(t)
C(t)
C(t-1)
PeepholeConnections
ASR Lecture 13 Recurrent Network Acoustic Models 15
LSTM equations
2 LSTM ARCHITECTURES
In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows
W = nc nc 4 + ni nc 4 + nc no + nc 3
where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive
As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units
An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = Wymmt + by (6)
where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product
inpu
t
g ct1 h
it
ft
ct
ot recu
rren
tpr
ojec
tion
outp
ut
xt
mt
pt
rt
rt1
yt
memory blocks
Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity
of the vectors and g and h are the cell input and cell output activationfunctions generally tanh
With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows
it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)
ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)
mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)
yt = Wyrrt + Wyppt + by (14)(15)
where the r and p denote the recurrent and optional non-recurrentunit activations
21 Implementation
We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization
We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively
ASR Lecture 13 Recurrent Network Acoustic Models 16
Google LSTM experiments (2014ndashdate)
Google voice search data ndash 1900h training set
Input features 40-d log mel filterbank energies
No input context single frame 40-d frame presented eachtimestep
LSTM networks
400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states
Delay output targets by 5 frames
Use RNN to estimated scaled likelihoods
Train using BPTT
ASR Lecture 13 Recurrent Network Acoustic Models 17
LSTM acoustic modelling architectures
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 18
LSTM with projection layer (LSTMP)
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 19
LSTM ASR Results (Google Voice Search task)
Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters
C P Depth N Dev Train WER() () ()
840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124
6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107
2048 512 1L 13M 673 718 113
where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc
By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer
With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added
rt = Wrmmt (7)yt = (Wyrrt + by) (8)
where the r denote the recurrent unit activations
24 Deep LSTMP
Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model
3 Distributed Training Scaling up toLarge Models with Parallelization
We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized
instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization
We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes
This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism
4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies
41 Systems amp Evaluation
All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value
(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20
Reading
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798
1735VtN93Mehb5k
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162
089976600300015015VtN9ncehb5k
T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http
wwwcstredacukdownloadspublications1996rnn4csr96pdf
H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf
ASR Lecture 13 Recurrent Network Acoustic Models 21
LSTM v2
I(t) H(t-1)
H(t)
ForgetGate
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471
ASR Lecture 13 Recurrent Network Acoustic Models 14
LSTM v3
I(t) H(t-1)
H(t)
C(t)
C(t-1)
PeepholeConnections
ASR Lecture 13 Recurrent Network Acoustic Models 15
LSTM equations
2 LSTM ARCHITECTURES
In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows
W = nc nc 4 + ni nc 4 + nc no + nc 3
where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive
As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units
An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = Wymmt + by (6)
where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product
inpu
t
g ct1 h
it
ft
ct
ot recu
rren
tpr
ojec
tion
outp
ut
xt
mt
pt
rt
rt1
yt
memory blocks
Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity
of the vectors and g and h are the cell input and cell output activationfunctions generally tanh
With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows
it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)
ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)
mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)
yt = Wyrrt + Wyppt + by (14)(15)
where the r and p denote the recurrent and optional non-recurrentunit activations
21 Implementation
We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization
We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively
ASR Lecture 13 Recurrent Network Acoustic Models 16
Google LSTM experiments (2014ndashdate)
Google voice search data ndash 1900h training set
Input features 40-d log mel filterbank energies
No input context single frame 40-d frame presented eachtimestep
LSTM networks
400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states
Delay output targets by 5 frames
Use RNN to estimated scaled likelihoods
Train using BPTT
ASR Lecture 13 Recurrent Network Acoustic Models 17
LSTM acoustic modelling architectures
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 18
LSTM with projection layer (LSTMP)
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 19
LSTM ASR Results (Google Voice Search task)
Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters
C P Depth N Dev Train WER() () ()
840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124
6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107
2048 512 1L 13M 673 718 113
where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc
By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer
With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added
rt = Wrmmt (7)yt = (Wyrrt + by) (8)
where the r denote the recurrent unit activations
24 Deep LSTMP
Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model
3 Distributed Training Scaling up toLarge Models with Parallelization
We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized
instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization
We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes
This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism
4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies
41 Systems amp Evaluation
All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value
(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20
Reading
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798
1735VtN93Mehb5k
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162
089976600300015015VtN9ncehb5k
T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http
wwwcstredacukdownloadspublications1996rnn4csr96pdf
H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf
ASR Lecture 13 Recurrent Network Acoustic Models 21
LSTM v3
I(t) H(t-1)
H(t)
C(t)
C(t-1)
PeepholeConnections
ASR Lecture 13 Recurrent Network Acoustic Models 15
LSTM equations
2 LSTM ARCHITECTURES
In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows
W = nc nc 4 + ni nc 4 + nc no + nc 3
where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive
As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units
An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = Wymmt + by (6)
where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product
inpu
t
g ct1 h
it
ft
ct
ot recu
rren
tpr
ojec
tion
outp
ut
xt
mt
pt
rt
rt1
yt
memory blocks
Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity
of the vectors and g and h are the cell input and cell output activationfunctions generally tanh
With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows
it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)
ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)
mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)
yt = Wyrrt + Wyppt + by (14)(15)
where the r and p denote the recurrent and optional non-recurrentunit activations
21 Implementation
We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization
We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively
ASR Lecture 13 Recurrent Network Acoustic Models 16
Google LSTM experiments (2014ndashdate)
Google voice search data ndash 1900h training set
Input features 40-d log mel filterbank energies
No input context single frame 40-d frame presented eachtimestep
LSTM networks
400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states
Delay output targets by 5 frames
Use RNN to estimated scaled likelihoods
Train using BPTT
ASR Lecture 13 Recurrent Network Acoustic Models 17
LSTM acoustic modelling architectures
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 18
LSTM with projection layer (LSTMP)
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 19
LSTM ASR Results (Google Voice Search task)
Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters
C P Depth N Dev Train WER() () ()
840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124
6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107
2048 512 1L 13M 673 718 113
where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc
By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer
With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added
rt = Wrmmt (7)yt = (Wyrrt + by) (8)
where the r denote the recurrent unit activations
24 Deep LSTMP
Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model
3 Distributed Training Scaling up toLarge Models with Parallelization
We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized
instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization
We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes
This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism
4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies
41 Systems amp Evaluation
All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value
(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20
Reading
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798
1735VtN93Mehb5k
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162
089976600300015015VtN9ncehb5k
T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http
wwwcstredacukdownloadspublications1996rnn4csr96pdf
H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf
ASR Lecture 13 Recurrent Network Acoustic Models 21
LSTM equations
2 LSTM ARCHITECTURES
In the standard architecture of LSTM networks there are an inputlayer a recurrent LSTM layer and an output layer The input layeris connected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cell inputunits input gates output gates and forget gates The cell output unitsare connected to the output layer of the network The total numberof parameters W in a standard LSTM network with one cell in eachmemory block ignoring the biases can be calculated as follows
W = nc nc 4 + ni nc 4 + nc no + nc 3
where nc is the number of memory cells (and number of memoryblocks in this case) ni is the number of input units and no is thenumber of output units The computational complexity of learningLSTM models per weight and time step with the stochastic gradientdescent (SGD) optimization technique is O(1) Therefore the learn-ing computational complexity per time step is O(W ) The learn-ing time for a network with a relatively small number of inputs isdominated by the nc (nc + no) factor For the tasks requiring alarge number of output units and a large number of memory cells tostore temporal contextual information learning LSTM models be-come computationally expensive
As an alternative to the standard architecture we propose twonovel architectures to address the computational complexity oflearning LSTM models The two architectures are shown in thesame Figure 1 In one of them we connect the cell output units toa recurrent projection layer which connects to the cell input unitsand gates for recurrency in addition to network output units for theprediction of the outputs Hence the number of parameters in thismodel is nc nr 4+ni nc 4+nr no +nc nr +nc 3where nr is the number of units in the recurrent projection layer Inthe other one in addition to the recurrent projection layer we addanother non-recurrent projection layer which is directly connected tothe output layer This model has nc nr 4+ni nc 4+(nr +np) no + nc (nr + np) + nc 3 parameters where np is thenumber of units in the non-recurrent projection layer and it allowsus to increase the number of units in the projection layers withoutincreasing the number of parameters in the recurrent connections(nc nr 4) Note that having two projection layers with regardto output units is effectively equivalent to having a single projectionlayer with nr + np units
An LSTM network computes a mapping from an input sequencex = (x1 xT ) to an output sequence y = (y1 yT ) by cal-culating the network unit activations using the following equationsiteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wmfmt1 + Wcfct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = Wymmt + by (6)
where the W terms denote weight matrices (eg Wix is the matrixof weights from the input gate to the input) the b terms denote biasvectors (bi is the input gate bias vector) is the logistic sigmoidfunction and i f o and c are respectively the input gate forget gateoutput gate and cell activation vectors all of which are the same sizeas the cell output activation vector m is the element-wise product
inpu
t
g ct1 h
it
ft
ct
ot recu
rren
tpr
ojec
tion
outp
ut
xt
mt
pt
rt
rt1
yt
memory blocks
Fig 1 LSTM based RNN architectures with a recurrent projectionlayer and an optional non-recurrent projection layer A single mem-ory block is shown for clarity
of the vectors and g and h are the cell input and cell output activationfunctions generally tanh
With the proposed LSTM architecture with both recurrent andnon-recurrent projection layer the equations are as follows
it = (Wixxt + Wirrt1 + Wicct1 + bi) (7)ft = (Wfxxt + Wrfrt1 + Wcfct1 + bf ) (8)
ct = ft ct1 + it g(Wcxxt + Wcrrt1 + bc) (9)ot = (Woxxt + Worrt1 + Wocct + bo) (10)
mt = ot h(ct) (11)rt = Wrmmt (12)pt = Wpmmt (13)
yt = Wyrrt + Wyppt + by (14)(15)
where the r and p denote the recurrent and optional non-recurrentunit activations
21 Implementation
We choose to implement the proposed LSTM architectures on multi-core CPU on a single machine rather than on GPU The decisionwas based on CPUrsquos relatively simpler implementation complexityand ease of debugging CPU implementation also allows easier dis-tributed implementation on a large cluster of machines if the learn-ing time of large networks becomes a major bottleneck on a singlemachine [14] For matrix operations we use the Eigen matrix li-brary [15] This templated C++ library provides efficient implemen-tations for matrix operations on CPU using vectorized instructions(SIMD ndash single instruction multiple data) We implemented acti-vation functions and gradient calculations on matrices using SIMDinstructions to benefit from parallelization
We use the asynchronous stochastic gradient descent (ASGD)optimization technique The update of the parameters with the gra-dients is done asynchronously from multiple threads on a multi-coremachine Each thread operates on a batch of sequences in parallelfor computational efficiency ndash for instance we can do matrix-matrixmultiplications rather than vector-matrix multiplications ndash and formore stochasticity since model parameters can be updated from mul-tiple input sequence at the same time In addition to batching of se-quences in a single thread training with multiple threads effectively
ASR Lecture 13 Recurrent Network Acoustic Models 16
Google LSTM experiments (2014ndashdate)
Google voice search data ndash 1900h training set
Input features 40-d log mel filterbank energies
No input context single frame 40-d frame presented eachtimestep
LSTM networks
400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states
Delay output targets by 5 frames
Use RNN to estimated scaled likelihoods
Train using BPTT
ASR Lecture 13 Recurrent Network Acoustic Models 17
LSTM acoustic modelling architectures
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 18
LSTM with projection layer (LSTMP)
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 19
LSTM ASR Results (Google Voice Search task)
Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters
C P Depth N Dev Train WER() () ()
840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124
6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107
2048 512 1L 13M 673 718 113
where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc
By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer
With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added
rt = Wrmmt (7)yt = (Wyrrt + by) (8)
where the r denote the recurrent unit activations
24 Deep LSTMP
Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model
3 Distributed Training Scaling up toLarge Models with Parallelization
We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized
instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization
We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes
This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism
4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies
41 Systems amp Evaluation
All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value
(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20
Reading
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798
1735VtN93Mehb5k
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162
089976600300015015VtN9ncehb5k
T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http
wwwcstredacukdownloadspublications1996rnn4csr96pdf
H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf
ASR Lecture 13 Recurrent Network Acoustic Models 21
Google LSTM experiments (2014ndashdate)
Google voice search data ndash 1900h training set
Input features 40-d log mel filterbank energies
No input context single frame 40-d frame presented eachtimestep
LSTM networks
400ndash6000 LSTM cells per layer1ndash7 layers13Mndash37M trainable parameters14727 context-dependent states
Delay output targets by 5 frames
Use RNN to estimated scaled likelihoods
Train using BPTT
ASR Lecture 13 Recurrent Network Acoustic Models 17
LSTM acoustic modelling architectures
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 18
LSTM with projection layer (LSTMP)
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 19
LSTM ASR Results (Google Voice Search task)
Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters
C P Depth N Dev Train WER() () ()
840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124
6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107
2048 512 1L 13M 673 718 113
where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc
By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer
With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added
rt = Wrmmt (7)yt = (Wyrrt + by) (8)
where the r denote the recurrent unit activations
24 Deep LSTMP
Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model
3 Distributed Training Scaling up toLarge Models with Parallelization
We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized
instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization
We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes
This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism
4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies
41 Systems amp Evaluation
All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value
(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20
Reading
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798
1735VtN93Mehb5k
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162
089976600300015015VtN9ncehb5k
T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http
wwwcstredacukdownloadspublications1996rnn4csr96pdf
H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf
ASR Lecture 13 Recurrent Network Acoustic Models 21
LSTM acoustic modelling architectures
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 18
LSTM with projection layer (LSTMP)
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 19
LSTM ASR Results (Google Voice Search task)
Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters
C P Depth N Dev Train WER() () ()
840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124
6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107
2048 512 1L 13M 673 718 113
where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc
By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer
With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added
rt = Wrmmt (7)yt = (Wyrrt + by) (8)
where the r denote the recurrent unit activations
24 Deep LSTMP
Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model
3 Distributed Training Scaling up toLarge Models with Parallelization
We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized
instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization
We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes
This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism
4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies
41 Systems amp Evaluation
All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value
(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20
Reading
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798
1735VtN93Mehb5k
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162
089976600300015015VtN9ncehb5k
T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http
wwwcstredacukdownloadspublications1996rnn4csr96pdf
H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf
ASR Lecture 13 Recurrent Network Acoustic Models 21
LSTM with projection layer (LSTMP)
inpu
t
g cell h
it
ft
ct
ot
recu
rren
t
outp
ut
xt
mt
rt
rt1
yt
LSTM memory blocks
Figure 1 LSTMP RNN architecture A single memory block isshown for clarity
memory cell The output gate controls the output flow of cellactivations into the rest of the network Later the forget gatewas added to the memory block [18] This addressed a weak-ness of LSTM models preventing them from processing contin-uous input streams that are not segmented into subsequencesThe forget gate scales the internal state of the cell before addingit as input to the cell through the self-recurrent connection ofthe cell therefore adaptively forgetting or resetting the cellrsquosmemory In addition the modern LSTM architecture containspeephole connections from its internal cells to the gates in thesame cell to learn precise timing of the outputs [19]
An LSTM network computes a mapping from an inputsequence x = (x1 xT ) to an output sequence y =(y1 yT ) by calculating the network unit activations usingthe following equations iteratively from t = 1 to T
it = (Wixxt + Wimmt1 + Wicct1 + bi) (1)ft = (Wfxxt + Wfmmt1 + Wfcct1 + bf ) (2)
ct = ft ct1 + it g(Wcxxt + Wcmmt1 + bc) (3)ot = (Woxxt + Wommt1 + Wocct + bo) (4)
mt = ot h(ct) (5)yt = (Wymmt + by) (6)
where the W terms denote weight matrices (eg Wix is the ma-trix of weights from the input gate to the input) Wic Wfc Woc
are diagonal weight matrices for peephole connections the bterms denote bias vectors (bi is the input gate bias vector) isthe logistic sigmoid function and i f o and c are respectivelythe input gate forget gate output gate and cell activation vec-tors all of which are the same size as the cell output activationvector m is the element-wise product of the vectors g and hare the cell input and cell output activation functions generallyand in this paper tanh and is the network output activationfunction softmax in this paper
22 Deep LSTM
As with DNNs with deeper architectures deep LSTM RNNshave been successfully used for speech recognition [11 17 2]Deep LSTM RNNs are built by stacking multiple LSTM lay-ers Note that LSTM RNNs are already deep architectures inthe sense that they can be considered as a feed-forward neu-ral network unrolled in time where each layer shares the samemodel parameters One can see that the inputs to the modelgo through multiple non-linear layers as in DNNs however thefeatures from a given time instant are only processed by a sin-gle nonlinear layer before contributing the output for that time
input
LSTM
output
(a) LSTM
input
LSTM
LSTM
output
(b) DLSTM
input
LSTM
recurrent
output
(c) LSTMP
input
LSTM
recurrent
LSTM
recurrent
output
(d) DLSTMP
Figure 2 LSTM RNN architectures
instant Therefore the depth in deep LSTM RNNs has an ad-ditional meaning The input to the network at a given time stepgoes through multiple LSTM layers in addition to propagationthrough time and LSTM layers It has been argued that deeplayers in RNNs allow the network to learn at different timescales over the input [20] Deep LSTM RNNs offer anotherbenefit over standard LSTM RNNs They can make better useof parameters by distributing them over the space through mul-tiple layers For instance rather than increasing the memorysize of a standard model by a factor of 2 one can have 4 lay-ers with approximately the same number of parameters Thisresults in inputs going through more non-linear operations pertime step
23 LSTMP - LSTM with Recurrent Projection Layer
The standard LSTM RNN architecture has an input layer a re-current LSTM layer and an output layer The input layer is con-nected to the LSTM layer The recurrent connections in theLSTM layer are directly from the cell output units to the cellinput units input gates output gates and forget gates The celloutput units are also connected to the output layer of the net-work The total number of parameters N in a standard LSTMnetwork with one cell in each memory block ignoring the bi-ases can be calculated as N = nc nc 4 + ni nc 4 +nc no + nc 3 where nc is the number of memory cells(and number of memory blocks in this case) ni is the numberof input units and no is the number of output units The com-putational complexity of learning LSTM models per weight andtime step with the stochastic gradient descent (SGD) optimiza-tion technique is O(1) Therefore the learning computationalcomplexity per time step is O(N) The learning time for a net-work with a moderate number of inputs is dominated by thenc (4 nc + no) factor For the tasks requiring a largenumber of output units and a large number of memory cells tostore temporal contextual information learning LSTM modelsbecome computationally expensive
As an alternative to the standard architecture we proposedthe Long Short-Term Memory Projected (LSTMP) architec-ture to address the computational complexity of learning LSTMmodels [3] This architecture shown in Figure 1 has a separatelinear projection layer after the LSTM layer The recurrent con-nections now connect from this recurrent projection layer to theinput of the LSTM layer The network output units are con-nected to this recurrent layer The number of parameters in thismodel is ncnr4+ninc4+nrno+ncnr +nc3
(Sak 2014)
ASR Lecture 13 Recurrent Network Acoustic Models 19
LSTM ASR Results (Google Voice Search task)
Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters
C P Depth N Dev Train WER() () ()
840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124
6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107
2048 512 1L 13M 673 718 113
where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc
By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer
With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added
rt = Wrmmt (7)yt = (Wyrrt + by) (8)
where the r denote the recurrent unit activations
24 Deep LSTMP
Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model
3 Distributed Training Scaling up toLarge Models with Parallelization
We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized
instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization
We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes
This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism
4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies
41 Systems amp Evaluation
All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value
(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20
Reading
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798
1735VtN93Mehb5k
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162
089976600300015015VtN9ncehb5k
T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http
wwwcstredacukdownloadspublications1996rnn4csr96pdf
H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf
ASR Lecture 13 Recurrent Network Acoustic Models 21
LSTM ASR Results (Google Voice Search task)
Table 1 Experiments with LSTM and LSTMP RNN architec-tures showing test set WERs and frame accuracies on devel-opment and training sets L indicates the number of layersfor shallow (1L) and deep (2457L) networks C indicates thenumber of memory cells P the number of recurrent projectionunits and N the total number of parameters
C P Depth N Dev Train WER() () ()
840 - 5L 37M 677 707 109440 - 5L 13M 676 701 108600 - 2L 13M 664 685 113385 - 7L 13M 662 685 112750 - 1L 13M 633 655 124
6000 800 1L 36M 673 749 1182048 512 2L 22M 688 720 1081024 512 3L 20M 693 725 1071024 512 2L 15M 690 740 107800 512 2L 13M 690 727 107
2048 512 1L 13M 673 718 113
where nr is the number of units in the recurrent projection layerIn this case the model size and the learning computational com-plexity are dominated by the nr (4nc +no) factor Hencethis allows us to reduce the number of parameters by the rationrnc
By setting nr lt nc we can increase the model memory(nc) and still be able to control the number of parameters in therecurrent connections and output layer
With the proposed LSTMP architecture the equations forthe activations of network units change slightly the mt1 acti-vation vector is replaced with rt1 and the following is added
rt = Wrmmt (7)yt = (Wyrrt + by) (8)
where the r denote the recurrent unit activations
24 Deep LSTMP
Similar to deep LSTM we propose deep LSTMP where multi-ple LSTM layers each with a separate recurrent projection layerare stacked LSTMP allows the memory of the model to be in-creased independently from the output layer and recurrent con-nections However we noticed that increasing the memory sizemakes the model more prone to overfitting by memorizing theinput sequence data We know that DNNs generalize better tounseen examples with increasing depth The depth makes themodels harder to overfit to the training data since the inputsto the network need to go through many non-linear functionsWith this motivation we have experimented with deep LSTMParchitectures where the aim is increasing the memory size andgeneralization power of the model
3 Distributed Training Scaling up toLarge Models with Parallelization
We chose to implement the LSTM RNN architectures on multi-core CPU rather than on GPU The decision was based onCPUrsquos relatively simpler implementation complexity ease ofdebugging and the ability to use clusters made from commod-ity hardware For matrix operations we used the Eigen matrixlibrary [21] This templated C++ library provides efficient im-plementations for matrix operations on CPU using vectorized
instructions We implemented activation functions and gradi-ent calculations on matrices using SIMD instructions to benefitfrom parallelization
We use the truncated backpropagation through time (BPTT)learning algorithm [22] to compute parameter gradients on shortsubsequences of the training utterances Activations are for-ward propagated for a fixed step time Tbptt (eg 20) Crossentropy gradients are computed for this subsequence and back-propagated to its start For computational efficiency each threadoperates on subsequences of four utterances at a time so matrixmultiplies can operate in parallel on four frames at a time Weuse asynchronous stochastic gradient descent (ASGD) [23] tooptimize the network parameters updating the parameters asyn-chronously from multiple threads on a multi-core machine Thiseffectively increases the batch size and reduces the correlationof the frames in a given batch After a thread has updated theparameters it continues with the next subsequence in each utter-ance preserving the LSTM state or starts new utterances withreset state when one finishes Note that the last subsequenceof each utterance can be shorter than Tbptt but is padded to thefull length though no gradient is generated for these paddingframes
This highly parallel single machine ASGD framework de-scribed in [3] proved slow for training models of the scale wehave used for large scale ASR with DNNs (many millions ofparameters) To scale further we replicate the single-machineworkers on many (eg 500) separate machines each with threesynchronized computation threads Each worker communi-cates with a shared distributed parameter server [23] whichstores the LSTM parameters When a worker has computed theparameter gradient on a minibatch (of 34Tbptt frames) thegradient vector is partitioned and sent to the parameter servershards which each add the gradients to their parameters and re-spond with the new parameters The parameter server shardsaggregate parameter updates completely asynchronously Forinstance gradient updates from workers may arrive in differentorders at different shards of the parameter server Despite theasynchrony we observe stable convergence though the learn-ing rate must be reduced as would be expected because of theincrease in the effective batch size from the greater parallelism
4 ExperimentsWe evaluate and compare the performance of LSTM RNN ar-chitectures on a large vocabulary speech recognition task ndash theGoogle Voice Search task We use a hybrid approach [24]for acoustic modeling with LSTM RNNs wherein the neuralnetworks estimate hidden Markov model (HMM) state posteri-ors We scale the state posteriors by the state priors estimatedas the relative state frequency from the training data to obtainthe acoustic frame likelihoods We deweight the silence statecounts by a factor of 27 when estimating the state frequencies
41 Systems amp Evaluation
All the networks are trained on a 3 million utterance (about1900 hours) dataset consisting of anonymized and hand-transcribed utterances The dataset is represented with 25msframes of 40-dimensional log-filterbank energy features com-puted every 10ms The utterances are aligned with a 85 millionparameter DNN with 14247 CD states The weights in all thenetworks are initialized to the range (-002 002) with a uni-form distribution We try to set the learning rate specific to anetwork architecture and its configuration to the largest value
(Sak 2014)ASR Lecture 13 Recurrent Network Acoustic Models 20
Reading
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798
1735VtN93Mehb5k
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162
089976600300015015VtN9ncehb5k
T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http
wwwcstredacukdownloadspublications1996rnn4csr96pdf
H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf
ASR Lecture 13 Recurrent Network Acoustic Models 21
Reading
S Hochreiter and J Schmidhuber (1997) ldquoLong Short-TermMemoryrdquo Neural Computation 91735ndash1780httpwwwmitpressjournalsorgdoiabs101162neco199798
1735VtN93Mehb5k
FA Gers et al (2000) ldquoLearning to Forget Continual Predictionwith LSTMrdquo Neural Computation 122451ndash2471httpwwwmitpressjournalsorgdoiabs101162
089976600300015015VtN9ncehb5k
T Robinson et al (1996) ldquoThe use of recurrent networks incontinuous speech recognitionrdquo in Automatic Speech and SpeakerRecognition Advanced Topics Lee et al (eds) Kluwer 233ndash258http
wwwcstredacukdownloadspublications1996rnn4csr96pdf
H Sak et al (2014) ldquoLSTM recurrent neural network architecturesfor large scale acoustic modelingrdquo Interspeech-2014httpresearchgooglecompubsarchive43905pdf
ASR Lecture 13 Recurrent Network Acoustic Models 21