32
Deep Speech: Scaling Up End-to-End Speech Recognition Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni Speech Processing and Recognition Seminar 2016

Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

Deep Speech: Scaling Up End-to-End Speech Recognition

Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates,

Andrew Y. Ng

Presentation by Roee Aharoni Speech Processing and Recognition Seminar 2016

Page 2: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

“Traditional” Speech Recognition Systems are Complex…

Page 3: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

“Traditional” Speech Recognition Systems are Complex…

Page 4: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

“Traditional” Speech Recognition Systems are Complex…

• Composed of multiple hand-engineered components: • Acoustic models • Speaker adaptation • Noise filtering • Hidden Markov Models • …

Page 5: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

“Traditional” Speech Recognition Systems are Complex…

• Composed of multiple hand-engineered components: • Acoustic models • Speaker adaptation • Noise filtering • Hidden Markov Models • …

• Will the deep learning approach help? “One network to rule them all”

Page 6: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

…Deep learning has its flaws

• Main problems: • Requires lots of training data • Training large neural networks is computationally expensive

• Possible solutions: • Find a good way to automatically generate synthetic data for training • Use custom GPU architectures to make training feasible

Page 7: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

Talk Outline• Background - Deep Learning and Recurrent Neural Networks

• The Connectionist Temporal Classification (CTC) model

• GPU optimizations

• Data capture and data synthesis

• Results

• Summary

Page 8: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

Deep Learning

“A family of learning methods that use deep architectures to learn high-level

feature representations”

Page 9: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

A basic machine learning setup

• Given a dataset of: training examples,

• input:

• output:

• Learn a function to predict correctly on new inputs.

• step I: pick a learning algorithm (SVM, log. reg., NN…)

• step II: optimize it w.r.t a loss, i.e:

Page 10: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

Logistic regression - the “1-layer” network

• Model the classier as:

• Learn the weight vector: using gradient-descent (next slide)

• is a non-linearity, e.g. the sigmoid function:

�(z) = 11+e�z

Page 11: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

Training (log. regression) with gradient-descent• Define the loss-function (squared error, cross entropy…):

• Derive the loss-function w.r.t. the weight vector, w:

• Perform gradient-descent:

• start with a random weight vector

• repeat until convergence:

Page 12: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

Multi layer perceptron (MLP) - a multi-layer NN

“high level” features

• Model the classifier as:

• Can be seen as multilayer logistic regression

• a.k.a feed-forward NN

Page 13: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

Training (an MLP) with Backpropagation:

Page 14: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

Training (an MLP) with Backpropagation:• Assume two outputs per input:

• Define the loss-function per example:

• Derive the loss-function w.r.t. the last layer:

• Derive the loss function w.r.t. the first layer:

• Update the weights:

Page 15: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

• A deeper architecture is more expressive than a shallow one given same number of nodes [Bishop, 1995]

• 1-layer nets (log. regression) can only model linear hyperplanes • 2-layer nets can model any continuous function (given sufficient nodes) • >3-layer nets can do so with fewer nodes

Why deeper is better?

Example - the XOR problem:

Page 16: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

• Enable variable length inputs (sequences)

• Modeling internal structure in the input or output

• Introduce a “memory/context” component to utilize history

Recurrent Neural Networks (RNN’s)

Output

Hidden

InputContext

Page 17: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

• “Horizontally deep” architecture

• Recurrence equations:

• Transition function:

• Output function: , usually implemented as softmax

Recurrent Neural Networks (RNN’s)

yt = Y (ht)

ht = H(ht�1, xt) = tanh(Wxt�1 + Uht�1 + b)

Page 18: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

• Enables to output a probability distribution over k possible classes

• Can be seen as trying to minimize the cross-entropy between the predictions and the truth

• d usually holds log-likelihood values

The Softmax Function

p(x = i) = eyikP

j=1eyj

Page 19: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

• As before, define a loss function (per sample, through time ):

• Derive the loss function w.r.t. parameters , starting at :

• Backpropagate through time - update and repeat for , until :

• Eventually, update the weights:

Training (RNN’s) with Backpropagation Through Time

r⇥ = �Jt�⇥ t = T

t = 1, 2, ..., T

t = 1t� 1

⇥ = �r⇥

Loss = J(⇥, x) = �TP

t=1Jt(⇥, xt)

r⇥ = @Jt@⇥

r⇥ = r⇥+ @Jt@⇥

Page 20: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

Back to speech recognition: the basic idea

T H E

Page 21: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

Connectionist Temporal Classification (CTC) (Graves et. al., 2006)

= THE-CAT

_ _ _ _TH_ _ _ _EEE_ _-_ _ _C_ _ _AAA_ _ _ _ _TT_ _-

_ _TTT H_ _ _ _E_ _-_ _ CC_ _ _A_A_ _ _ TT_ T_ _ _ _ -

_ _THE_ _ _-_ _ _ _ _ _ _ _ _ _ CA_ T _ _ _ _ _ _ _ _ _ -_

_THE_ _ _-_ _ CA_ T _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ -

Page 22: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

Connectionist Temporal Classification (CTC) (Graves et. al., 2006)

_ _ _ _TH_ _ _ _EEE_ _-_ _ _C_ _ _AAA_ _ _ _ _TT_ _-

_ _TTT H_ _ _ _E_ _-_ _ CC_ _ _A_A_ _ _ TT_ T_ _ _ _ -

_ _THE_ _ _-_ _ _ _ _ _ _ _ _ _ CA_ T _ _ _ _ _ _ _ _ _ -_

_THE_ _ _-_ _ CA_ T _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

= THE-CAT

_T_H_E_-_C_A_T_

Page 23: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

The CTC Model (Graves et. al., 2006)• At each step, the network can output a “blank” label or any

character in the vocabulary L

• Transform the network outputs into a conditional probability distribution over label sequences, which enables to compute the probability for each path:

• The total probability of any label sequence can be found by summing the probabilities of the different paths leading to it:

_ _ _ _TH_ _ _ _EEE_ _-_ _ _C_ _ _AAA_ _ _ _ _TT_ _-

_ _TTT H_ _ _ _E_ _-_ _ CC_ _ _A_A_ _ _ TT_ T_ _ _ _ -

_ _THE_ _ _-_ _ _ _ _ _ _ _ _ _ CA_ T _ _ _ _ _ _ _ _ _ -_

_THE_ _ _-_ _ CA_ T _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

= THE-CAT

_T_H_E_-_C_A_T_

Page 24: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

Training CTC with Gradient Descent• It is computationally expensive to sum over all possible paths

• The solution: Prefix Search Decoding - efficiently sum over the possible paths by iteratively expanding prefixes and suffixes that match the desired output sequence, using the Forward-Backward algorithm:

Page 25: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

Training CTC with Gradient Descent• We get that:

• So, to compute the probability of the correct output, we can (efficiently) calculate:

• And the loss function we would like to minimize over the training set while performing SGD is:

_ _ _ _TH_ _ _ _EEE_ _-_ _ _C_ _ _AAA_ _ _ _ _TT_ _-

_ _TTT H_ _ _ _E_ _-_ _ CC_ _ _A_A_ _ _ TT_ T_ _ _ _ -

_ _THE_ _ _-_ _ _ _ _ _ _ _ _ _ CA_ T _ _ _ _ _ _ _ _ _ -_

_THE_ _ _-_ _ CA_ T _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

= THE-CAT

_T_H_E_-_C_A_T_

Page 26: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

GPU Parallelization

GPU I GPU II

• 5 billion connections in the network

• Data Parallelism

• Model Parallelism

• Resulting in 2x Speedup

H E T

Page 27: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

Synthetic Data

Page 28: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

Synthetic Data“Inflated” the training data with synthesized data, going from about 7000 hours to over 100,000 hours

Page 29: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

Results• Measured in

Word Error Rate (WER)

• Over the Switchboard Hub5’00 dataset

• State of the art, esp. on noisy data

Page 30: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

Conclusions

• “Traditional” speech recognition systems are complex

• End-to-end deep learning models are simpler, but require lots of training data and are computationally expensive

• By creating synthetic data and using multiple GPU’s, state of the art results are acquired

Page 31: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

Questions?

Page 32: Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

References• Connectionist Temporal Classification: Labelling

Unsegmented Sequence Data with Recurrent Neural Networks (Graves et. al., 2006)

• Deep Speech: Scaling up end-to-end speech recognition (Hannun et. al., 2014)

• Deep Speech 2: End-to-End Speech Recognition in English and Mandarin (Hannun et. al., 2015)

• Stanford Seminar - Awni Hannun of Baidu Research