Advanced Signal Processing 2 - SPSC Lab | Signal ... · PDF fileAdvanced Signal Processing 2 ... (Matlab) • A lot of ... surface of a sphere only requires a few parameter using an

SPSC – Signal Processing & Speech Communication Lab

Professor Horst Cerjak, 19.12.2005

1

Name Ort, Datum Präsentationstitel

Advanced Signal Processing 2

Deep Neural Networks for Acoustic Modeling in Speech Recognition

Bernd Bachofner



2


Automatic speech recognition in all day life

• Heavily used in cell (smart) phones • Used in desktop computing for a lot of different tasks • In cars for controlling the sat-nav, radio, air-condition….



3


Consequences We want algorithms that fulfill the following requirements

• Speaker independent

It should be possible for the user to start immediately with speech recognition without a previous trainings phase

• Robustness The algorithm should not be sensible to ambient noise

• Fast The algorithm should react within an acceptable time frame (a few ms)

• Efficient It should be possible to run the algorithm on embedded systems with limited resources



4


Quality

How to measure the quality of such algorithms? The performance of speech recognition system is usually evaluated in terms of accuracy and speed • Word error rate (WER)

The accuracy is usually rated with WER

• Real time factor The speed is measured with the real time factor

• Single Word Error Rate (SWER)

• Command Success Rate (CSR)



5


The classic approach Gaussian Mixture Models with Hidden Markov Models

(GMM-HMM)

• Acoustic inputs is represented by concatenating Mel-frequency cepstral coefficients (MFCCs)

• Computed from the raw waveform

• First- and second-order temporal differences are used



6


GMM-HMM I

• First-order temporal difference

Δ𝑥𝑚 =𝑥𝑚+𝑇−𝑥𝑚−𝑇

2 (velocity)

• Second –order temporal difference

Δ2𝑥𝑚 =Δ𝑥𝑚+𝑇−Δ𝑥𝑚−𝑇

2 (acceleration)

• Discard a large amount of information that is irrelevant for discrimination

• GMM represent the relationship between HMM states and the acoustic input



7


GMM-HMM II

• GMM model the probability distributions over vectors of input features that are associated with each state of an HMM

• With enough components, GMMs can model probability distributions to any required level of accuracy, and they are ‘easy’ to fit to the data using the EM (Expectation Maximization) algorithm.



8


GMM-HMM III

Advantages of GMM-HMM • GMM-HMM method ‘long’ in use => a lot of experience, many information

sources are available

• All algorithms like EM are available and ready to use (Matlab)

• A lot of research was done to improve the performance, so the algorithm are able to run on embedded devices and smart phones

• It is possible to parallelize the trainings phase, so it is possible to take advantage of multicore or cluster systems.

• GMM-HMM is so successful that it was taken as a reference for all new developed speech recognition algorithms



9


GMM-HMM IV

Drawbacks of GMM-HMM • Statistically inefficient for modeling data that lie on or near a nonlinear

manifold in the data space, modeling the set of points that lie very close to the surface of a sphere only requires a few parameter using an appropriate model class, but requires a very large number of Gaussians

• Speech is produced by a relatively small number of parameters of dynamical system => its true underlying structure is much lower-dimensional than in a window that contains hundreds of coefficients

• GMM-HMM need uncorrelated data => filter-bank coefficients as input representation can not be used



10


Deep Neural Networks (DNN)

• A DNN is feed-forward, artificial neural network that has more than one layer of hidden units between its inputs and its outputs

• Each hidden unit, typically uses the logistic function to map its total input from the layer below, 𝑥𝑗, to the scalar state, 𝑦𝑗 that it sends to the layer above

𝑦𝑗 = 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑥𝑗 =1

1 + 𝑒−𝑥𝑗

𝑥𝑗 = 𝑏𝑗 + 𝑦𝑖𝑤𝑖𝑗𝑖



11


DNN I

• Multiclass classification, output unit 𝑗 converts its total input 𝑥𝑗, into a class probability, 𝑝𝑗, by using the softmax nonlinearity

𝑝𝑗 =exp(𝑥𝑗)

exp(𝑥𝑘)𝑘

where 𝑘 is an index over all classes

• DNNs can be discriminatively trained (DT) by backpropagating derivatives of a cost function that measures the discrepancy between the target and the outputs. In the softmax case the cost function 𝐶 is the cross entropy between the target probabilities 𝑑 and the outputs of the softmax, 𝑝

𝐶 = − 𝑑𝑗𝑙𝑜𝑔𝑝𝑗𝑗

where the target probabilities, typically taking values of one or zero

• For large trainings sets, it is more efficient to compute the derivatives on a small random ‘minibatch’, before updating the weights in proportion to the gradient



12


DNN II • Stochastic gradient descent

∆𝑤𝑖𝑗(𝑡)=𝛼Δ𝑤𝑖𝑗 𝑡 − 1 − 𝜖𝜕𝐶

𝜕𝑤𝑖𝑗 𝑡

where 𝛼 is the ‘momentum’ coefficient, 0 < 𝛼 < 1, that smooths the gradient computed by the minibatch 𝑡

• Problems with this architecture • Overfitting: to reduce overfitting, large weights can be panelized, or the

learning can simply be terminated at the point at which performance on a validation set starts getting worse



13


DNN III

• Problems with this architecture continued • Initial weights: in DNN with full connectivity between adjacent layers,

initial weights are given small random values to prevent all hidden units in a layer to get the same gradient.

• Very large training sets can reduce overfitting while preserving modeling power => very computationally expensive

• The consequence out of this is that a better method of using the information in the training set is required => Generative Pretraining



14


Generative Pretraining using Restricted Boltzmann Machines (RBM)

• Boltzmann Machines: Motivation • Basis for deep networks • Can be trained on unlabeled data (Unsupervised Learning) • Training becomes tractable with RBMs • Stacking of RBMs leads to ‘deep belief networks’

• A Boltzmann machine (BM) is

• A generative model • A neural network • With stochastic units • Recurrently connected • With symmetric weights

Binary units with outputs 𝑠𝑖𝜖 0,1 for 𝑖𝜖{1, … , 𝑁}



15


RBM I • Stochastic binary units

• Activation 𝑧𝑖 = 𝑏𝑖 + 𝑠𝑗𝑤𝑖𝑗𝑗

• Output of unit 𝑖 is 𝑠𝑖𝜖{0,1} with 𝑝𝑟𝑜𝑏 𝑠𝑖 = 1 = 𝜎 𝑧𝑖 =1

1+𝑒−𝑧𝑖

• Symmetric weights: 𝑤𝑖𝑗 = 𝑤𝑗𝑖

• Inputs/Outputs • A set of visible units (output units) • A set of hidden units

• Given a network with N gates

• Start with some arbitrary assignment of 𝑠1, … , 𝑠𝑁 • Iterate

• Randomly choose one neuron 𝑖 and update its state 𝑠𝑖 according to

𝑝𝑟𝑜𝑏 𝑠𝑖 = 1 = 𝜎 𝑧𝑖 =1

1+𝑒−𝑧𝑖



16


RBM II • Computation

• A BM represents a probability distribution over a set of visible units • Let 𝑣 = (𝑠1, … 𝑠𝑁) be the network state vector => after sufficiently many updates

• The distribution of network states reaches an equilibrium, where the probability of state vector 𝑣 is given by

𝑃 𝑣 = 1

𝑍 𝑒−𝐸(𝑣,ℎ)

ℎ

with energy

𝐸 𝑣, ℎ = − 𝑎𝑖𝑣𝑖 − 𝑏𝑗ℎ𝑗 − 𝑣𝑖ℎ𝑗𝑤𝑖𝑗𝑖,𝑗𝑗𝜖ℎ𝑖𝑑𝑑𝑒𝑛𝑖𝜖𝑣𝑖𝑠𝑖𝑏𝑙𝑒

and

𝑍 = 𝑒−𝐸(𝑣)

𝑣

• Goals • Can learn to represent the distribution of training data

• Find parameters 𝑤𝑖𝑗, 𝑎𝑖, 𝑏𝑖 such that 𝑃 𝑣 approximates the distribution of the training data

• Can generate instances from this distribution (sampling) • Can be used to classify the data



17


RBM III

• RBMs are Boltzmann machines with a restricted architecture • One visible layer • One hidden layer • Only connections between the visible and the hidden layer • The hidden units are conditionally independent given the visible units • We can therefore sample < 𝑣𝑖ℎ𝑗 >𝑑𝑎𝑡𝑎in one run over the data • < 𝑣𝑖ℎ𝑗 >𝑚𝑜𝑑𝑒𝑙 requires multiple integrations to get to the equilibrium

distribution

• Contrastive divergence (CD) • Approximate < 𝑣𝑖ℎ𝑗 >𝑚𝑜𝑑𝑒𝑙 by < 𝑣𝑖ℎ𝑗 >𝑟𝑒𝑐𝑜𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 which is computed in

the following way: • Start with data vector on visible units, update all hidden units • Update all visible units to get a ‘reconstruction’ • Update all hidden units again

• Use < 𝑣𝑖ℎ𝑗 >𝑟𝑒𝑐𝑜𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛 in the learning rule instead of < 𝑣𝑖ℎ𝑗 >𝑚𝑜𝑑𝑒𝑙 Δ𝑤𝑖𝑗 = 𝜀 < 𝑣𝑖ℎ𝑗 >𝑑𝑎𝑡𝑎−< 𝑣𝑖ℎ𝑗 >𝑟𝑒𝑐𝑜𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛



18


Modeling real-valued data

• MFCCs are more naturally modeled by linear variables with Gaussian noise • RBM energy function can be modified to accommodate such variables

• Gaussian-Bernoulli RBM (GRBM)

𝐸 𝑣, ℎ = (𝑣𝑖 − 𝑎𝑖)

2

2𝜎2𝑖𝜖𝑣𝑖𝑠

− 𝑏𝑗ℎ𝑗𝑗𝜖ℎ𝑖𝑑

− 𝑣𝑖𝜎𝑖ℎ𝑗𝑤𝑖𝑗

𝑖,𝑗

where 𝜎𝑖 is the standard deviation of the Gaussian noise for visible unit 𝑖

• The two conditional distributions required for CD learning

𝑝 ℎ𝑗 𝑣 = 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐(𝑏𝑗 + 𝑣𝑖𝜎𝑖𝑤𝑖𝑗)

𝑖

𝑝 𝑣𝑖 ℎ = 𝒩(𝑎𝑖 + 𝜎𝑖 ℎ𝑗𝑤𝑖𝑗 ,

𝑗

𝜎2)

• The data are normalized so that each coefficient has zero mean and unit variance, the standard deviations are set to one

• No noise is added to the reconstruction



19


Stacking RBMs to make a Deep Belief Network I

• After training a RBM on data, the inferred states of the hidden units can be used as trainings data for another RBM

• Learns to model the significant dependencies between the hidden units of the first RBM

• Can be repeated as many times as desired • Produces many layers of nonlinear feature detectors, that represent

progressively more complex statistical structure in the data • Produces a Deep Belief Network (DBN)

• DBN is a hybrid generative model whose top two layers are undirected (RBM

stack) but whose lower layers have top-down directed connections



20


Stacking RBMs to make a Deep Belief Network II

• To understand how RBMs are composed into DBN rewrite the formula for state vector 𝑣 to make the dependence of 𝑊 explicit

𝑝 𝑣;𝑊 = 𝑝 ℎ;𝑊 𝑝(𝑣|ℎ;𝑊)

ℎ

• By holding 𝑝 𝑣|ℎ;𝑊 fixed after training, but replacing the prior over hidden vectors 𝑝 ℎ;𝑊 by a better prior, that is closer to the aggregated posterior over hidden vectors that can be sampled by first picking a training case an then inferring a hidden vector using

𝑝 ℎ𝑗 𝑣 = 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐(𝑏𝑗 + 𝑣𝑖𝜎𝑖𝑤𝑖𝑗)

𝑖

• This aggregated posterior is what the next RBM in the stack is trained to model • How deep such a stack should be?



21


The right Stack size

• Every time a new RBM is added to the stack, the variational bound on the new and deeper DBN is better than the previous

• Mathematical methods exist to determine the bound • But does not answer the question if the learned feature detectors are useful

for discrimination on a task that is unknown while training • DBN allow to infer the states of the layers of hidden units in a single forward pass

• Is used to derive the variational bound

• All the learned weights of the RBMs are then used to initialize the feature detecting layers of a deterministic feed forward DNN

• A softmax layer is finally added and then the whole DNN is trained discriminatively



22


Interfacing a DNN with an HMM

• DNN outputs probabilities of the form 𝑝(𝐻𝑀𝑀𝑠𝑡𝑎𝑡𝑒|𝐴𝑐𝑜𝑢𝑠𝑡𝑖𝑐𝐼𝑛𝑝𝑢𝑡) • But for computing a Viterbi alignment the likelihood

𝑝 𝐴𝑐𝑜𝑢𝑠𝑡𝑖𝑐𝐼𝑛𝑝𝑢𝑡 𝐻𝑀𝑀𝑠𝑡𝑎𝑡𝑒

is required

• This can be achieved by dividing the posterior probabilities by the frequencies of the HMM states

• All of the likelihoods produced in this way are scaled by the same unknown factor of 𝑝(𝐴𝑐𝑜𝑢𝑠𝑡𝑖𝑐𝐼𝑛𝑝𝑢𝑡), but this has no effect on the alignment



23


Phonetic classification and recognition on TIMIT I

• TIMIT data set provides a simple way of testing new approaches to speech recognition

• Small training set => try many variations in reasonable time • Many existing techniques have already been benchmarked => easy to see if a

new approach is promising by comparing it with existing techniques • Performance improvements on TIMIT do not necessarily translate into

performance improvement on large vocabulary task with less controlled recording conditions and much more training data

• DBN-DNN acoustic model outperformed the best published recognition results on TIMIT

• DBN-DNN hidden layers seams to make a good job in eliminating speaker differences

• The DBN-DNN configuration that worked best on the TIMIT data where used for subsequent experiments on much larger vocabulary tasks

• For simplicity, all hidden layers had the same size • Even with this constraints it was not possible to train all possible

combinations of number of hidden layers (1, 2, 3, 4, 5, 6, 7, 8); number of units per layer (512, 1024, 2048, 3072); number of frames of acoustic data in input layer (7, 11, 15, 17, 27, 37)



24


Phonetic classification and recognition on TIMIT II

• Fortunately, the performance of the networks on TIMIT core test set was fairly insensitive to the precise details of the architecture and the results suggest that any combination of the numbers (previous slide) has an error rate within about 2% of the very best combination

• Important for robustness of DBN-DNNs since they have a lot of tunable parameters

• Conclusion: • Multiple hidden layers always worked better than one hidden layer • With multiple hidden layers, pretraining always improved the results



25


Preprocessing the waveform for DNNs

• State of the art ASR systems do not use filter-bank coefficients as the input representation because they are strongly correlated

• Modeling them requires a huge number of Gaussians

• MFCCs offer a more suitable alternative as their individual components are (roughly) independent

• Modeling them is much easier as Gaussian Mixture Models can be used

• DBN-DNNs do not require uncorrelated data • DBN-DNNs trained with filter-bank features had a phone error rate 1.7%

lower than the best performing DBN-DNNs trained with MFCCs



26


A summary of the differences between DNNs and GMMs

• Both DNNs and GMMs are nonlinear models • A DNN has no problem modeling multiple simultaneous events within one

frame or window because it can use different subsets of its hidden units to model different events

• A GMM assumes that each data point is generated by a single component of the mixture so it has no efficient way of modeling multiple simultaneous events

• DNNs are good at exploiting multiple frames of input coefficients whereas GMMs that use diagonal covariance matrices benefit much less from multiple frames because they require decorrelated inputs

• DNNs are learned using stochastic gradient descent, while GMMs are learned using EM algorithm, which makes GMM learning much easier to parallelize on cluster or multicore machines



27


Comparing DBN-DNNs with GMMs for large vocabulary speech recognition

• To make DBN-DNNs work well on large vocabulary tasks it is important to replace the monophone HMMs used for TIMIT with triphone HMMs that have many thousands of tied stats

• Triphone supply’s more bits of information per frame in the labels • Triphone enables the use of a more powerful triphone HMM decoder

• Using context dependent HMM states, it is possible to outperform state of the art

GMM-HMM systems with a two hidden layer neural network without using any pretraining => more hidden layers and pretraining works even better

• Some examples • Bing-Voice-Search speech recognition task • Google-Voice-Input speech recognition task



28


Bing-Voice-Search speech recognition task

• First successful use of acoustic models based on DBN-DNNs for a large vocabulary task

• The task used 24 hours of training data with a high degree of acoustic variability caused by noise, music, side-speech, accents, sloppy pronunciation, hesitation, repetition, interruptions and mobile phone differences

• The results of the best DNN-HMM acoustic model trained context-dependent states as targets achieved a sentence accuracy of 69.6% on the test set, compared with 63.8% for a strong minimum phone error (MPE)-trained GMM-HMM

• The DBN-DNN used five pretrained layers of hidden units with 2048 units per layer

• It was found that using tied triphone context-dependent state targets was crucial and clearly superior to using monophone state targets

• Using pretraining on an even larger data set (48 h) improved the performance of the DBN-DNN insignificant to 69.8%. Whereas the same approach on DNN-HMMs achieved an performance boost to 71.7%



29


Google Voice Input speech recognition task

• Google Voice Input transcribes voice search queries, short messages, e-mails and user actions from mobile devices

• Google‘s model uses a GMM-HMM model composed of context-dependent crossword triphone HMMs

• This model has a total of 7969 states and uses as acoustic input PLP features that have been transformed by LDA

• This model was used to obtain 5870 h of training data for a DBN-DNN model that predicts the 7969 HMM state posteriors from the acoustic input

• The DBN-DNN had four hidden layers with 2560 fully connected units per layer and a final softmax layer with 7969 alternative states.

• Each DBN-DNN layer was pretrained for one epoch as an RBM and then the resulting DNN was fine-tuned for one epoch

• Weights with magnitudes below a threshold were then set to zero before a further epoch of training

• 1 3 of the weights in the final network were zero • Results:

• On a test set of anonymized utterances from the live Voice Input system, the DBN-DNN based system achieved a WER of 12.3% - 23% relative reduction compared to the best GMM based system



30


Comparison of the percentage WERs using DNN-HMMs and GMM-HMMs



31


References

• G. Hinton, L. Deng, D. Yu, G. Dahl, A.Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury Deep Neural Networks for Acoustic Modeling in Speech Recognition

• Pattern Recognition and Machine Learning, Christopher M. Bishop, Springer; Auflage: 1st ed. 2006. Corr. 2nd printing 2011 (2007)

• Neuronal Networks A, Restricted Boltzmann Machines, Dr. Robert Legenstein, December 2012

Documents

Advanced Signal Processing 2 - SPSC Lab | Signal ... · PDF fileAdvanced Signal Processing 2 ... (Matlab) • A lot of ... surface of a sphere only requires a few parameter using an