Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

Deep Speech: Scaling Up End-to-End Speech Recognition

Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates,

Andrew Y. Ng

Presentation by Roee Aharoni Speech Processing and Recognition Seminar 2016

“Traditional” Speech Recognition Systems are Complex…



• Composed of multiple hand-engineered components: • Acoustic models • Speaker adaptation • Noise filtering • Hidden Markov Models • …


• Composed of multiple hand-engineered components: • Acoustic models • Speaker adaptation • Noise filtering • Hidden Markov Models • …

• Will the deep learning approach help? “One network to rule them all”

…Deep learning has its flaws

• Main problems: • Requires lots of training data • Training large neural networks is computationally expensive

• Possible solutions: • Find a good way to automatically generate synthetic data for training • Use custom GPU architectures to make training feasible

Talk Outline• Background - Deep Learning and Recurrent Neural Networks

• The Connectionist Temporal Classification (CTC) model

• GPU optimizations

• Data capture and data synthesis

• Results

• Summary

Deep Learning

“A family of learning methods that use deep architectures to learn high-level

feature representations”

A basic machine learning setup

• Given a dataset of: training examples,

• input:

• output:

• Learn a function to predict correctly on new inputs.

• step I: pick a learning algorithm (SVM, log. reg., NN…)

• step II: optimize it w.r.t a loss, i.e:

Logistic regression - the “1-layer” network

• Model the classier as:

• Learn the weight vector: using gradient-descent (next slide)

• is a non-linearity, e.g. the sigmoid function:

�(z) = 11+e�z

Training (log. regression) with gradient-descent• Define the loss-function (squared error, cross entropy…):

• Derive the loss-function w.r.t. the weight vector, w:

• Perform gradient-descent:

• start with a random weight vector

• repeat until convergence:

Multi layer perceptron (MLP) - a multi-layer NN

“high level” features

• Model the classifier as:

• Can be seen as multilayer logistic regression

• a.k.a feed-forward NN

Training (an MLP) with Backpropagation:

Training (an MLP) with Backpropagation:• Assume two outputs per input:

• Define the loss-function per example:

• Derive the loss-function w.r.t. the last layer:

• Derive the loss function w.r.t. the first layer:

• Update the weights:

• A deeper architecture is more expressive than a shallow one given same number of nodes [Bishop, 1995]

• 1-layer nets (log. regression) can only model linear hyperplanes • 2-layer nets can model any continuous function (given sufficient nodes) • >3-layer nets can do so with fewer nodes

Why deeper is better?

Example - the XOR problem:

• Enable variable length inputs (sequences)

• Modeling internal structure in the input or output

• Introduce a “memory/context” component to utilize history

Recurrent Neural Networks (RNN’s)

Output

Hidden

InputContext

• “Horizontally deep” architecture

• Recurrence equations:

• Transition function:

• Output function: , usually implemented as softmax

Recurrent Neural Networks (RNN’s)

yt = Y (ht)

ht = H(ht�1, xt) = tanh(Wxt�1 + Uht�1 + b)

• Enables to output a probability distribution over k possible classes

• Can be seen as trying to minimize the cross-entropy between the predictions and the truth

• d usually holds log-likelihood values

The Softmax Function

p(x = i) = eyikP

j=1eyj

• As before, define a loss function (per sample, through time ):

• Derive the loss function w.r.t. parameters , starting at :

• Backpropagate through time - update and repeat for , until :

• Eventually, update the weights:

Training (RNN’s) with Backpropagation Through Time

r⇥ = �Jt�⇥ t = T

t = 1, 2, ..., T

t = 1t� 1

⇥ = �r⇥

Loss = J(⇥, x) = �TP

t=1Jt(⇥, xt)

r⇥ = @Jt@⇥

r⇥ = r⇥+ @Jt@⇥

Back to speech recognition: the basic idea

T H E

Connectionist Temporal Classification (CTC) (Graves et. al., 2006)

= THE-CAT

_ _ _ _TH_ _ _ _EEE_ _-_ _ _C_ _ _AAA_ _ _ _ _TT_ _-

_ _TTT H_ _ _ _E_ _-_ _ CC_ _ _A_A_ _ _ TT_ T_ _ _ _ -

…

_ _THE_ _ _-_ _ _ _ _ _ _ _ _ _ CA_ T _ _ _ _ _ _ _ _ _ -_

_THE_ _ _-_ _ CA_ T _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ -

Connectionist Temporal Classification (CTC) (Graves et. al., 2006)

_ _ _ _TH_ _ _ _EEE_ _-_ _ _C_ _ _AAA_ _ _ _ _TT_ _-

_ _TTT H_ _ _ _E_ _-_ _ CC_ _ _A_A_ _ _ TT_ T_ _ _ _ -

…

_ _THE_ _ _-_ _ _ _ _ _ _ _ _ _ CA_ T _ _ _ _ _ _ _ _ _ -_

_THE_ _ _-_ _ CA_ T _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

= THE-CAT

_T_H_E_-_C_A_T_

The CTC Model (Graves et. al., 2006)• At each step, the network can output a “blank” label or any

character in the vocabulary L

• Transform the network outputs into a conditional probability distribution over label sequences, which enables to compute the probability for each path:

• The total probability of any label sequence can be found by summing the probabilities of the different paths leading to it:

_ _ _ _TH_ _ _ _EEE_ _-_ _ _C_ _ _AAA_ _ _ _ _TT_ _-

_ _TTT H_ _ _ _E_ _-_ _ CC_ _ _A_A_ _ _ TT_ T_ _ _ _ -

…

_ _THE_ _ _-_ _ _ _ _ _ _ _ _ _ CA_ T _ _ _ _ _ _ _ _ _ -_

_THE_ _ _-_ _ CA_ T _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

= THE-CAT

_T_H_E_-_C_A_T_

Training CTC with Gradient Descent• It is computationally expensive to sum over all possible paths

• The solution: Prefix Search Decoding - efficiently sum over the possible paths by iteratively expanding prefixes and suffixes that match the desired output sequence, using the Forward-Backward algorithm:

Training CTC with Gradient Descent• We get that:

• So, to compute the probability of the correct output, we can (efficiently) calculate:

• And the loss function we would like to minimize over the training set while performing SGD is:

_ _ _ _TH_ _ _ _EEE_ _-_ _ _C_ _ _AAA_ _ _ _ _TT_ _-

_ _TTT H_ _ _ _E_ _-_ _ CC_ _ _A_A_ _ _ TT_ T_ _ _ _ -

…

_ _THE_ _ _-_ _ _ _ _ _ _ _ _ _ CA_ T _ _ _ _ _ _ _ _ _ -_

_THE_ _ _-_ _ CA_ T _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

= THE-CAT

_T_H_E_-_C_A_T_

GPU Parallelization

GPU I GPU II

• 5 billion connections in the network

• Data Parallelism

• Model Parallelism

• Resulting in 2x Speedup

H E T

Synthetic Data

Synthetic Data“Inflated” the training data with synthesized data, going from about 7000 hours to over 100,000 hours

Results• Measured in

Word Error Rate (WER)

• Over the Switchboard Hub5’00 dataset

• State of the art, esp. on noisy data

Conclusions

• “Traditional” speech recognition systems are complex

• End-to-end deep learning models are simpler, but require lots of training data and are computationally expensive

• By creating synthetic data and using multiple GPU’s, state of the art results are acquired

Questions?

References• Connectionist Temporal Classification: Labelling

Unsegmented Sequence Data with Recurrent Neural Networks (Graves et. al., 2006)

• Deep Speech: Scaling up end-to-end speech recognition (Hannun et. al., 2014)

• Deep Speech 2: End-to-End Speech Recognition in English and Mandarin (Hannun et. al., 2015)

• Stanford Seminar - Awni Hannun of Baidu Research

http://www.apple.com




Documents

Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni