Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
Deep Speech: Scaling Up End-to-End Speech Recognition
Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates,
Andrew Y. Ng
Presentation by Roee Aharoni Speech Processing and Recognition Seminar 2016
“Traditional” Speech Recognition Systems are Complex…
“Traditional” Speech Recognition Systems are Complex…
“Traditional” Speech Recognition Systems are Complex…
• Composed of multiple hand-engineered components: • Acoustic models • Speaker adaptation • Noise filtering • Hidden Markov Models • …
“Traditional” Speech Recognition Systems are Complex…
• Composed of multiple hand-engineered components: • Acoustic models • Speaker adaptation • Noise filtering • Hidden Markov Models • …
• Will the deep learning approach help? “One network to rule them all”
…Deep learning has its flaws
• Main problems: • Requires lots of training data • Training large neural networks is computationally expensive
• Possible solutions: • Find a good way to automatically generate synthetic data for training • Use custom GPU architectures to make training feasible
Talk Outline• Background - Deep Learning and Recurrent Neural Networks
• The Connectionist Temporal Classification (CTC) model
• GPU optimizations
• Data capture and data synthesis
• Results
• Summary
Deep Learning
“A family of learning methods that use deep architectures to learn high-level
feature representations”
A basic machine learning setup
• Given a dataset of: training examples,
• input:
• output:
• Learn a function to predict correctly on new inputs.
• step I: pick a learning algorithm (SVM, log. reg., NN…)
• step II: optimize it w.r.t a loss, i.e:
Logistic regression - the “1-layer” network
• Model the classier as:
• Learn the weight vector: using gradient-descent (next slide)
• is a non-linearity, e.g. the sigmoid function:
�(z) = 11+e�z
Training (log. regression) with gradient-descent• Define the loss-function (squared error, cross entropy…):
• Derive the loss-function w.r.t. the weight vector, w:
• Perform gradient-descent:
• start with a random weight vector
• repeat until convergence:
Multi layer perceptron (MLP) - a multi-layer NN
“high level” features
• Model the classifier as:
• Can be seen as multilayer logistic regression
• a.k.a feed-forward NN
Training (an MLP) with Backpropagation:
Training (an MLP) with Backpropagation:• Assume two outputs per input:
• Define the loss-function per example:
• Derive the loss-function w.r.t. the last layer:
• Derive the loss function w.r.t. the first layer:
• Update the weights:
• A deeper architecture is more expressive than a shallow one given same number of nodes [Bishop, 1995]
• 1-layer nets (log. regression) can only model linear hyperplanes • 2-layer nets can model any continuous function (given sufficient nodes) • >3-layer nets can do so with fewer nodes
Why deeper is better?
Example - the XOR problem:
• Enable variable length inputs (sequences)
• Modeling internal structure in the input or output
• Introduce a “memory/context” component to utilize history
Recurrent Neural Networks (RNN’s)
Output
Hidden
InputContext
• “Horizontally deep” architecture
• Recurrence equations:
• Transition function:
• Output function: , usually implemented as softmax
Recurrent Neural Networks (RNN’s)
yt = Y (ht)
ht = H(ht�1, xt) = tanh(Wxt�1 + Uht�1 + b)
• Enables to output a probability distribution over k possible classes
• Can be seen as trying to minimize the cross-entropy between the predictions and the truth
• d usually holds log-likelihood values
The Softmax Function
p(x = i) = eyikP
j=1eyj
• As before, define a loss function (per sample, through time ):
• Derive the loss function w.r.t. parameters , starting at :
• Backpropagate through time - update and repeat for , until :
• Eventually, update the weights:
Training (RNN’s) with Backpropagation Through Time
r⇥ = �Jt�⇥ t = T
t = 1, 2, ..., T
t = 1t� 1
⇥ = �r⇥
Loss = J(⇥, x) = �TP
t=1Jt(⇥, xt)
r⇥ = @Jt@⇥
r⇥ = r⇥+ @Jt@⇥
Back to speech recognition: the basic idea
T H E
Connectionist Temporal Classification (CTC) (Graves et. al., 2006)
= THE-CAT
_ _ _ _TH_ _ _ _EEE_ _-_ _ _C_ _ _AAA_ _ _ _ _TT_ _-
_ _TTT H_ _ _ _E_ _-_ _ CC_ _ _A_A_ _ _ TT_ T_ _ _ _ -
…
_ _THE_ _ _-_ _ _ _ _ _ _ _ _ _ CA_ T _ _ _ _ _ _ _ _ _ -_
_THE_ _ _-_ _ CA_ T _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ -
Connectionist Temporal Classification (CTC) (Graves et. al., 2006)
_ _ _ _TH_ _ _ _EEE_ _-_ _ _C_ _ _AAA_ _ _ _ _TT_ _-
_ _TTT H_ _ _ _E_ _-_ _ CC_ _ _A_A_ _ _ TT_ T_ _ _ _ -
…
_ _THE_ _ _-_ _ _ _ _ _ _ _ _ _ CA_ T _ _ _ _ _ _ _ _ _ -_
_THE_ _ _-_ _ CA_ T _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
= THE-CAT
_T_H_E_-_C_A_T_
The CTC Model (Graves et. al., 2006)• At each step, the network can output a “blank” label or any
character in the vocabulary L
• Transform the network outputs into a conditional probability distribution over label sequences, which enables to compute the probability for each path:
• The total probability of any label sequence can be found by summing the probabilities of the different paths leading to it:
_ _ _ _TH_ _ _ _EEE_ _-_ _ _C_ _ _AAA_ _ _ _ _TT_ _-
_ _TTT H_ _ _ _E_ _-_ _ CC_ _ _A_A_ _ _ TT_ T_ _ _ _ -
…
_ _THE_ _ _-_ _ _ _ _ _ _ _ _ _ CA_ T _ _ _ _ _ _ _ _ _ -_
_THE_ _ _-_ _ CA_ T _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
= THE-CAT
_T_H_E_-_C_A_T_
Training CTC with Gradient Descent• It is computationally expensive to sum over all possible paths
• The solution: Prefix Search Decoding - efficiently sum over the possible paths by iteratively expanding prefixes and suffixes that match the desired output sequence, using the Forward-Backward algorithm:
Training CTC with Gradient Descent• We get that:
• So, to compute the probability of the correct output, we can (efficiently) calculate:
• And the loss function we would like to minimize over the training set while performing SGD is:
_ _ _ _TH_ _ _ _EEE_ _-_ _ _C_ _ _AAA_ _ _ _ _TT_ _-
_ _TTT H_ _ _ _E_ _-_ _ CC_ _ _A_A_ _ _ TT_ T_ _ _ _ -
…
_ _THE_ _ _-_ _ _ _ _ _ _ _ _ _ CA_ T _ _ _ _ _ _ _ _ _ -_
_THE_ _ _-_ _ CA_ T _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
= THE-CAT
_T_H_E_-_C_A_T_
GPU Parallelization
GPU I GPU II
• 5 billion connections in the network
• Data Parallelism
• Model Parallelism
• Resulting in 2x Speedup
H E T
Synthetic Data
Synthetic Data“Inflated” the training data with synthesized data, going from about 7000 hours to over 100,000 hours
Results• Measured in
Word Error Rate (WER)
• Over the Switchboard Hub5’00 dataset
• State of the art, esp. on noisy data
Conclusions
• “Traditional” speech recognition systems are complex
• End-to-end deep learning models are simpler, but require lots of training data and are computationally expensive
• By creating synthetic data and using multiple GPU’s, state of the art results are acquired
Questions?
References• Connectionist Temporal Classification: Labelling
Unsegmented Sequence Data with Recurrent Neural Networks (Graves et. al., 2006)
• Deep Speech: Scaling up end-to-end speech recognition (Hannun et. al., 2014)
• Deep Speech 2: End-to-End Speech Recognition in English and Mandarin (Hannun et. al., 2015)
• Stanford Seminar - Awni Hannun of Baidu Research