Deep Speech: Scaling up end-to-end deep learning for speech

Deep Speech Scaling up end-‐to-‐end deep learning for speech

Deep Learning: Speech Recogni8on

Awni Hannun

Data (audio)

Output

Phoneme Model

Language Model

Acoustic model

“The quick brown fox jumps over the lazy dog”

ðə kwɪk brawn fɑks dʒəmps ovər ðə lezi dɒg.

Awni Hannun

Data (audio)

Output

Phoneme Model

Language Model

Deep Neural Network Acous8c Model

Awni Hannun

P(y | x; M) = P(ah | x0) * P(ax | x1) * P(eh | x2)

DNN Acous8c Model -‐ Alignment

Awni Hannun

“the quick brown”

Where do the labels come from?

th ah qw eh ke ba ra ow en

Expand to phonemes

Bootstrapped Recognizer

+ audio

th th th ah ah qw qw eh eh ke ke ke ba ba ba ra ra ra ow ow ow en en en

alignment

Retrain recognizer

Awni Hannun

Data (audio)

Output

Phoneme Model

Language Model

Awni Hannun

Data (audio)

Output

Deep Speech – Key ingredients

Awni Hannun

• Model •  No alignment needed, using objective from [Graves,

Fernandez, Gomez and Schmidhuber, 2006]

• Data

• Computation (GPUs)

Two hard modeling problems

Awni Hannun

y = “the quick brown …”

1. Loss function – compute P(y | x; M)

2. Inference - Find y* = argmaxy P(y | x; M)

Must handle variable length input and output

Deep Speech – Recurrent Neural Network

Awni Hannun

t h _ (blank) Output alphabet, space, & blank

Deep Speech -‐ CTC

Awni Hannun

No alignment needed!

P(_ _ T H _ _ _ _ E _ – _ C _ _ A A A _ _ T T _ _ –)

P(_ T _ _ H _ _ E E _ _ – _ C _ _ A A _ _ T _ _ _ –)

P(THE—CAT—)

Deep Speech -‐ Data

Awni Hannun

Translated House Reflected House Rotated house

Data – Synthe8c

Deep Speech -‐ Data

Awni Hannun

Speech

Noisy Speech

Deep Speech -‐ Hours of speech data

WSJ Switchboard Fisher Deep Speech

Series 1

80 300

Awni Hannun

Deep Speech -‐ Hours of speech data

100000

120000

WSJ Switchboard Fisher Deep Speech

80 300 2000

>100,000

Synthesized data

Awni Hannun

Deep Speech – Mandarin

Awni Hannun

Mandarin is a tonal language

Model can learn pitch from Spectrogram

Awni Hannun

Thousands of characters! > 80K

Pinyin?

Awni Hannun

Deep Speech: Scaling up end-to-end deep learning for speech

Science

Deep Speech: Scaling up end-to-end speech recognition · Deep Speech: Scaling up end-to-end speech recognition Awni Hannun⇤, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos,

Deep Speech 2: End-to-End Speech Recognition in English

Audio-Driven Facial Animation by Joint End-to-End Learning ...laines9/publications/k... · Our deep neural network for inferring facial animation from speech. The network takes approximately

Deep Speech: Recent Progress on Mandarin Speech Recognition

Performance Monitoring for End-to-End Speech Recognition · tomatic speech recognition (ASR) has been achieved via ad-vancements with Deep Neural Networks (DNNs). The main paradigm

Introduction to Human Language Technology · Speech recognition: HMM (Khudanpur) Deep learning (Watanabe) End-to-end neural speech recognition (Watanabe) Speaker identiﬁcation,

Deep neural network for speech synthesis

Deep Speech 2: End-to-End Speech Recognition in English ...web.eng.tau.ac.il/deep_learn/wp-content/uploads/2018/01/Speech-Recognition.pdfSpeech Recognition 19/12/2017 Deep Speech 1

Deep Learning for AI - microsoft.com...Chief Scientist of AI, Microsoft Applications/Services Group (ASG) & ... Tara Sainath & Andrew Senior) Baidu’s Deep Speech 2 End-to-End DL

Deep Speech: Scaling Up End-to-End Speech Recognition speech.pdf · Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng Presentation by Roee Aharoni

Speech Emotion Recognition Using Deep Neural Network ......Speech Emotion Recognition Using Deep Neural Network Considering Verbal and Nonverbal Speech Sounds Kun-Yi Huang, Chung-Hsien

Speech Recognition Acoustic Modeling in Deep Neural Networks formozer/Teaching/syllabi/Deep... · 2015-04-01 · Deep Neural Networks for Acoustic Modeling in Speech Recognition Hinton

End to end training with deep visiomotor

End-to-end approaches to speech recognition and language ...ttic.uchicago.edu/~klivescu/MLSLP2016/chorowski_MLSLP2016.pdf · End-to-end approaches to speech recognition and language

Powering AI Robots with Deep Learning - NVIDIA · 2017-11-06 · Deep Learning for Voice and Dialogue Speech LSTM-RNN (Recurrent Neural Networks) End-to-End Memory Networks (N2N MemNet)

End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learning for Speech and Language UPC 2017)

Speech Compression using Deep Learning

Speech Synthesis: WaveNet (D4L3 Deep Learning for Speech and Language UPC 2017)

Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

Deep Speech 2: End-to-End Speech ... - jesseengel.github.io · For deployment, we develop a batching scheduler to improve computational e˚ciency while minimizing latency. We also