Deep Neural Network and Its
Application in Speech
Recognition
Dong Yu
Microsoft Research
Thanks to my collaborators:
Li Deng, Frank Seide, Gang Li, Mike Seltzer, Jinyu Li, Jui-Ting Huang, Kaisheng Yao, Jian Xue, Yan Huang, Adam Eversole, George Dahl, Abdel-rahman Mohamed, Xie Chen, Hang Su, Ossama Abdel-Hamid, Eric Wang, Andrew Maas, and many more
Deep Learning – What, Why, and How
– A Tutorial Given at NLP&CC 2013 –
Part 3: Outline
CD-DNN-HMM
Training and Decoding Speed
Invariant Features
Mixed-bandwidth ASR
Multi-lingual ASR
Sequence Discriminative Training
Adaptation
Summary
Dong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
2
Deep Neural Network
A fancy name for multi-layer perceptron (MLP) with many hidden layers.
Each sigmoidal hidden neuron follows Bernoulli distribution
The last layer (softmax layer) follows multinomial distribution
p 𝑙 = 𝑘|𝐡; θ =𝑒𝑥𝑝 𝑖=1
𝐻 𝜆𝑖𝑘ℎ𝑖 + 𝑎𝑘
𝑍 𝒉
Training can be difficult and tricky. Optimization algorithm and strategy can be important.
Dong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
3
Deep Neural Network
A fancy name for multi-layer perceptron (MLP) with many hidden layers.
Each sigmoidal hidden neuron follows Bernoulli distribution
The last layer (softmax layer) follows multinomial distribution
p 𝑙 = 𝑘|𝐡; θ =𝑒𝑥𝑝 𝑖=1
𝐻 𝜆𝑖𝑘ℎ𝑖 + 𝑎𝑘
𝑍 𝒉
Training can be difficult and tricky. Optimization algorithm and strategy can be important.
Dong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
4
Restricted Boltzmann Machine(Hinton, Osindero, Teh 2006)
Joint distribution p 𝐯, 𝐡; θ is defined in terms of an energy function E 𝐯, 𝐡; θ
p 𝐯, 𝐡; θ =𝑒𝑥𝑝 −E 𝐯, 𝐡; θ
𝑍
p 𝐯; θ =
𝐡
𝑒𝑥𝑝 −E 𝐯, 𝐡; θ
𝑍=
𝑒𝑥𝑝 −F 𝒗; θ
𝑍
Conditional independence
𝑝 𝐡|𝐯 =
j=0
𝐻−1
𝑝 ℎ𝑗|𝐯
𝑝 𝐯|𝐡 =
i=0
𝑉−1
𝑝 𝑣𝑖|𝐡
Dong Yu : Deep Learning – What, Why, and How
Hidden Layer
No within layer connection
Visible Layer
No within layer connection
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
5
Generative Pretraining a DNN
First learn with all the weights tied◦ equivalent to learning an RBM
Then freeze the first layer of weights and learn the remaining weights (still tied together).◦ equivalent to learning another RBM,
using the aggregated conditional probability on 𝒉0 as the data
◦ Continue the process to train the next layer
Intuitively log 𝑝 𝒗 improves as new layer is added and trained.
Dong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
6
Generative Pretraining a DNN
First learn with all the weights tied◦ equivalent to learning an RBM
Then freeze the first layer of weights and learn the remaining weights (still tied together).◦ equivalent to learning another RBM,
using the aggregated conditional probability on 𝒉0 as the data
◦ Continue the process to train the next layer
Intuitively log 𝑝 𝒗 improves as new layer is added and trained.
Dong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
7
Generative Pretraining a DNN
First learn with all the weights tied◦ equivalent to learning an RBM
Then freeze the first layer of weights and learn the remaining weights (still tied together).◦ equivalent to learning another RBM,
using the aggregated conditional probability on 𝒉0 as the data
◦ Continue the process to train the next layer
Intuitively log 𝑝 𝒗 improves as new layer is added and trained.
Dong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
8
Generative Pretraining a DNN
First learn with all the weights tied◦ equivalent to learning an RBM
Then freeze the first layer of weights and learn the remaining weights (still tied together).◦ equivalent to learning another RBM,
using the aggregated conditional probability on 𝒉0 as the data
◦ Continue the process to train the next layer
Intuitively log 𝑝 𝒗 improves as new layer is added and trained.
Dong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
9
Discriminative Pretraining Train a single hidden layer DNN using BP (without convergence)
Insert a new hidden layer and train it using BP (without convergence)
Do the same thing till the predefined number of layers is reached
Jointly fine-tune all layers till convergence
Can reduce gradient diffusion problem
Guaranteed to help if done right
Dong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
10
CD-DNN-HMM: Three Key Components(Dahl, Yu, Deng, Acero 2012)
Dong Yu : Deep Learning – What, Why, and How
Model senones (tied
triphone states) directly
Many layers of
nonlinear feature
transformationLong window
of frames
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
11
Modeling Senones is Critical
ML-trained CD-GMM-HMM generated alignment
was used to generate senone and monophone labels
for training DNNs.
Dong Yu : Deep Learning – What, Why, and How
Model monophone senone
GMM-HMM MPE - 36.2
DNN-HMM 1× 2K 41.7 31.9
DNN-HMM 3 ×2k 35.8 30.4
Model monophone senone
GMM-HMM BMMI - 23.6
DNN-HMM 7× 2K 34.9 17.1
Table: 24-hr Voice Search (760 24-mixture senones)
Table: 309-hr SWB (9k 40-mixture senones)
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
12
Exploiting Neighbor Frames
ML-trained CD-GMM-HMM generated alignment was used to generate senone labels for training DNNs
It seems 23.2% is only slightly better than 23.6% but note that DNN is not trained using sequence-discriminative training but GMM is.
To exploit info in neighbor frames, GMM systems need to use fMPE, region dependent transformation, or tandem structure
Dong Yu : Deep Learning – What, Why, and How
Model 1 frame 11 frames
CD-DNN-HMM 1× 𝟒𝟔𝟑𝟒 26.0 22.4
CD-DNN-HMM 7 ×2k 23.2 17.1
Table: 309-hr SWB (GMM-HMM BMMI = 23.6%)
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
13
Deeper Model is More Powerful(Seide, Li, Yu 2011, Seide, Li, Chen, Yu 2011)
L×NDBN-
Pretrain1×N
DBN-
Pretrain
1×2k 24.2 1×2k 24.2
2×2k 20.4 - -
3×2k 18.4 - -
4 ×2k 17.8 - -
5×2k 17.2 1×3772 22.5
7 ×2k 17.1 1×4634 22.6
9×2k 17.0 - -
9× 𝟏k 17.9 - -5×3k 17.0 - -
1× 16k 22.1
Dong Yu : Deep Learning – What, Why, and How
Table: 309-hr SWB (GMM-HMM BMMI = 23.6%)
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
14
Pretraining Helps but Not Critical(Seide, Li, Yu 2011, Seide, Li, Chen, Yu 2011)
L×NDBN-
PretrainBP LBP
DiscriminativePretrain
1×2k 24.2 24.3 24.3 24.12×2k 20.4 22.2 20.7 20.43×2k 18.4 20.0 18.9 18.64 ×2k 17.8 18.7 17.8 17.85×2k 17.2 18.2 17.4 17.17 ×2k 17.1 17.4 17.4 16.89×2k 17.0 16.9 16.9 -9× 𝟏k 17.9 - - -
5×3k 17.0 - - -
• Stochastic gradient alleviates the optimization problem.
• Large amount of training data alleviates the overfitting
problem.
• Pretraining helps to make BP more robust
Dong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
15
CD-DNN-HMM Performance Summary(Dahl, Yu, Deng, Acero 2012, Seide, Li, Yu 2011, Chen et al. 2012)
AM Setup Hub5’00-SWB RT03S-FSH
GMM-HMM BMMI (9K 40-mixture) 23.6% 27.4%
DNN-HMM 7 x 2048 15.8% (-33%) 18.5% (-33%)
AM Setup Test
GMM-HMM MPE (760 24-mixture) 36.2%
DNN-HMM 5 layers x 2048 30.1% (-17%)
Table: Voice Search SER (24 hours training)
Table: Switch Board WER (309 hours training)
Table: Switch Board WER (2000 hours training)
AM Setup Hub5’00-SWB RT03S-FSH
GMM-HMM (A) BMMI (18K 72-mixture) 21.7% 23.0%
GMM-HMM (B) BMMI + fMPE 19.6% 20.5%
DNN-HMM 7 x 3076 14.4% (A: -34%
B: -27%)
15.6% (A: -32%
B: -24%)
Dong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
16
CD-DNN-HMM Real World Deploy
Microsoft audio video indexing service (Knies, 2012) ◦ “It’s a big deal. … tests have demonstrated that the
algorithm provides a 10- to 20-percent relative error reduction and uses about 30 percent less processing time than the best-of-breed speech-recognition algorithms based on so-called Gaussian Mixture Models.”
Google voice search (Simonite, 2012): ◦ “Google is now using these neural networks to recognize
speech more accurately, … "We got between 20 and 25 percent improvement in terms of words that are wrong," says Vincent Vanhoucke, a leader of Google's speech-recognition efforts.
Microsoft Bing voice search (Knies, 2013)◦ "With the judicious use of DNNs, that service has seen its
speed double in recent weeks, and its word-error rate has improved by 15 percent”
Other companies: Baidu, IBM, Nuance, Iflytech, etc.Dong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
17
Part 3: Outline
CD-DNN-HMM
Training and Decoding Speed
Invariant Features
Mixed-bandwidth ASR
Multi-lingual ASR
Sequence Discriminative Training
Adaptation
Summary
Dong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Sp
eed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
18
Decoding Speed(Senior et al. 2011)
Well within real time with careful engineering
Setup: (1) DNN: 440:2000X5:7969 (2) single CPU (4) GPU NVIDIA Tesla C2070
Dong Yu : Deep Learning – What, Why, and How
TechniqueReal time
factorNote
Floating-point baseline 3.89
Floating-point SSE2 1.36 4-way parallel (16 bytes)
8-bit quantization 1.52Hidden: unsigned char,
weight: signed char
Integer SSSE3 0.51 16-way parallel
Integer SSE4 0.47 Faster 16-32 conversion
Batching 0.36 batches over tens of ms
Lazy evaluation 0.26Assume 30% active
senone
Batched lazy evaluation 0.21 Combine both
CD
-DN
N-H
MM
| Sp
eed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
19
Decoding Speed(Xue et al. 2013)
Observations
◦ Weight sparseness: Only 20-30% weights need to be nonzero to get same performance
◦ Sparse pattern is random: hard to exploit
Low-rank factorization
◦ Model the weight matrix as product of two matrices
◦ SVD: 𝑊 = 𝑈Σ𝑉𝑇 = 𝑈Σ1/2 Σ1/2𝑉𝑇
◦ Keep only top singular values
◦ Model size compressed to 20-30%
◦ Additional three-fold speedup
◦ Faster than GMM decoding
Dong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Sp
eed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
20
Training (Single GPU)(Chen et al. 2012)
Relative runtime for different minibatch sizes and GPU/server model types, and corresponding frame accuracy measured after seeing 12 hours of data (429:2kX7:9304).
Dong Yu : Deep Learning – What, Why, and How
Fra
me a
ccura
cy
Rela
tive ru
ntim
e
Batch size
CD
-DN
N-H
MM
| Sp
eed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
21
Training (Multi-GPU): Pipeline(Chen et al. 2012)
Dong Yu : Deep Learning – What, Why, and How
Input Layer
Hidden Layer 1
Hidden Layer 1
Hidden Layer 2
Output Layer
Hidden Layer 2GPU3
GPU2
GPU1
Will cause delayed update
problem since forward pass
of new batch is calculated
on old weight
Passing hidden activation
is much more efficient than
passing weight or weight
gradient
CD
-DN
N-H
MM
| Sp
eed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
22
Multi-GPU with pipeline
Training (Multi-GPU): Pipeline(Chen et al. 2012)
Training runtimes in minutes per 24h of data for different parallelization configurations. [[·]] denotes divergence, and [·] denotes a WER loss > 0.1% points on the Hub5 set (429:2kX7:9304).
Dong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Sp
eed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
23
Training (CPU Cluster)(Dean et al. 2012, picture courtesy of Erdinc Basci)
Dong Yu : Deep Learning – What, Why, and How
Model Worker Pool
Parameter Servers Pool
Param Server Machine 1
P1 P2
P3 P4
Param Server Machine 2
P5 P6
P7 P8
Param Server Machine 3
P9 P10
P11 P12
Model Worker Machine 1
P1 P2 P3
Core1
Core2
Core3
Core4
Model Worker Machine 2
P4 P5 P6
Core1
Core2
Core3
Core4
Model Worker Machine 3
P7 P8 P9
Core1
Core2
Core3
Core4
Model Worker Machine 4
P10 P11 P12
Core1
Core2
Core3
Core4
Da
ta P
art
1
W ∆W W’
Model Worker Pool
Model Worker Machine 1
P1 P2 P3
Core1
Core2
Core3
Core4
Model Worker Machine 2
P4 P5 P6
Core1
Core2
Core3
Core4
Model Worker Machine 3
P7 P8 P9
Core1
Core2
Core3
Core4
Model Worker Machine 4
P10 P11 P12
Core1
Core2
Core3
Core4
Da
ta P
art
1
Model Worker Pool
Model Worker Machine 1
P1 P2 P3
Core1
Core2
Core3
Core4
Model Worker Machine 2
P4 P5 P6
Core1
Core2
Core3
Core4
Model Worker Machine 3
P7 P8 P9
Core1
Core2
Core3
Core4
Model Worker Machine 4
P10 P11 P12
Core1
Core2
Core3
Core4
Da
ta P
art
1
W ∆W W’ W ∆W W’
Asynchronous
stochastic
gradient update
Lower
communication
cost when
updating weights
CD
-DN
N-H
MM
| Sp
eed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
24
Training (CPU or GPU Cluster)(Kingsbury et al. 2012, Martens 2010)
Use algorithms that are effective with large batches.
◦ L-BFGS (work well if you use full batch)
◦ Hessian free
Simple data parallelization would work
Key: the communication cost is small compared to the calculation
Better and more scalable training algorithm is still in need.
Dong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Sp
eed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
25
Part 3: Outline
CD-DNN-HMM
Training and Decoding Speed
Invariant Features
Mixed-bandwidth ASR
Multi-lingual ASR
Sequence Discriminative Training
Adaptation
Summary
Dong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Speed
| Invaria
nt F
eatu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
26
What Makes ASR Difficult?
Variability, Variability, Variability
Speaker
• Accents
• Dialect
• Style
• Emotion
• Coarticulation
• Reduction
• Pronunciation
• Hesitation
• ⋯
Environment
• Noise
• Side talk
• Reverberation
• ⋯
Device
• Head phone
• Land phone
• Speaker phone
• Cell phone
• ⋯
Interactions between these factors are complicated and nonlinearDong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Speed
| Invaria
nt F
eatu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
27
DNN Is Powerful and Efficient
The desirable model should be powerful and
efficient to represent complex structures
◦ DNN can model any mapping (powerful):
universal approximator -> same as shallow model
◦ DNN is efficient in representation: need fewer
computational units for the same function by
sharing lower-layer results -> better than shallow
models
DNN learns invariant and discriminative
features
Dong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Speed
| Invaria
nt F
eatu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
28
DNN Learns Invariant and
Discriminative Features
Joint feature learning and classifier design◦ Bottleneck or tandem feature does
not have this property
Many simple non-linearities = One complicated non-linearity
Features at higher layers are more invariant and discriminative than those at lower layers
Dong Yu : Deep Learning – What, Why, and How
Many layers of nonlinear feature
transformation
Log-linear classifier
CD
-DN
N-H
MM
| Speed
| Invaria
nt F
eatu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
29
Higher Layer Features More Invariant(yu et al. 2013)
𝛿𝑙+1 = 𝜎 𝑧𝑙 𝑣𝑙 + 𝛿𝑙 − 𝜎 𝑧𝑙 𝑣𝑙 ≅ 𝑑𝑖𝑎𝑔 𝜎′ 𝑧𝑙 𝑣𝑙 𝑡 𝑤𝑙 𝑇𝛿𝑙 ≤
𝑑𝑖𝑎𝑔 𝜎′ 𝑧𝑙 𝑣𝑙 𝑡 𝑤𝑙 𝑇𝛿𝑙
Dong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Speed
| Invaria
nt F
eatu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
30
Higher Layer Features More Invariant(yu et al. 2013)
𝛿𝑙+1 = 𝜎 𝑧𝑙 𝑣𝑙 + 𝛿𝑙 − 𝜎 𝑧𝑙 𝑣𝑙 ≅ 𝑑𝑖𝑎𝑔 𝜎′ 𝑧𝑙 𝑣𝑙 𝑡 𝑤𝑙 𝑇𝛿𝑙 ≤
𝑑𝑖𝑎𝑔 𝜎′ 𝑧𝑙 𝑣𝑙 𝑡 𝑤𝑙 𝑇𝛿𝑙
Dong Yu : Deep Learning – What, Why, and How
0.0004883
0.0009766
0.0019531
0.0039063
0.0078125
0.015625
0.03125
0.0625
0.125
0.25
0.5
1
2
4
8
161%
5%
9%
13
%
17
%
21
%
25
%
29
%
33
%
37
%
41
%
45
%
49
%
53
%
57
%
61
%
65
%
69
%
73
%
77
%
81
%
85
%
89
%
93
%
97
%
Wei
gh
t m
ag
nit
ud
e
Percentage of weights whose magnitude is below the threshold
Layer 1 Layer 2 Layer 3 Layer 4
Layer 5 Layer 6 Layer 7
CD
-DN
N-H
MM
| Speed
| Invaria
nt F
eatu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
31
𝛿𝑙+1 = 𝜎 𝑧𝑙 𝑣𝑙 + 𝛿𝑙 − 𝜎 𝑧𝑙 𝑣𝑙 ≅ 𝑑𝑖𝑎𝑔 𝜎′ 𝑧𝑙 𝑣𝑙 𝑡 𝑤𝑙 𝑇𝛿𝑙 ≤
𝑑𝑖𝑎𝑔 𝜎′ 𝑧𝑙 𝑣𝑙 𝑡 𝑤𝑙 𝑇𝛿𝑙
0%
10%
20%
30%
40%
50%
60%
70%
80%
1 2 3 4 5 6
h>0.99
h<0.01
Dong Yu : Deep Learning – What, Why, and How
Percentage of saturated hidden units.
H<0.01 are inactive neurons. Higher layers are more sparse
Higher Layer Features More Invariant(yu et al. 2013)
<=0.25 and smaller
when saturated
CD
-DN
N-H
MM
| Speed
| Invaria
nt F
eatu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
32
Dong Yu : Deep Learning – What, Why, and How
𝑑𝑖𝑎𝑔 𝜎′ 𝑧𝑙 𝑣𝑙 𝑡 𝑤𝑙 𝑇
If the norm <1 the variation shrinks one layer higher
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 2 3 4 5 6
average
maximum
Higher Layer Features More Invariant(yu et al. 2013)
𝛿𝑙+1 = 𝜎 𝑧𝑙 𝑣𝑙 + 𝛿𝑙 − 𝜎 𝑧𝑙 𝑣𝑙 ≅ 𝑑𝑖𝑎𝑔 𝜎′ 𝑧𝑙 𝑣𝑙 𝑡 𝑤𝑙 𝑇𝛿𝑙 ≤
𝑑𝑖𝑎𝑔 𝜎′ 𝑧𝑙 𝑣𝑙 𝑡 𝑤𝑙 𝑇𝛿𝑙
CD
-DN
N-H
MM
| Speed
| Invaria
nt F
eatu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
33
Noise Robustness(Seltzer, Yu, Wang 2013)
DNN converts input features into more invariant and discriminative features
Robust to environment and speaker variations
Aurora 4 16kHz medium vocabulary noise robustness task◦ Training: 7137 utterances from 83 speakers
◦ Test: 330 utterances from 8 speakers
Dong Yu : Deep Learning – What, Why, and How
Setup Set A Set B Set C Set D Avg
GMM-HMM (Baseline) 12.5 18.3 20.5 31.9 23.9
GMM-HMM (MPE + VAT) 7.2 12.8 11.5 19.7 15.3
GMM-HMM + Structured SVM 7.4 12.6 10.7 19.0 14.8
GMM-HMM (VAT-Joint) 5.6 11.0 8.8 17.8 13.4
CD-DNN-HMM (2kx7) 5.6 8.8 8.9 20.0 13.4
CD-DNN-HMM+NAT+dropout 5.4 8.3 7.6 18.5 12.4
Table: WER (%) Comparison on Aurora4 (16k Hz) Dataset.
CD
-DN
N-H
MM
| Speed
| Invaria
nt F
eatu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
34
Balance Overfitting and Underfitting
Achieved by adjusting width and depth -> shallow model is lack of depth adjustment
DNN adds constrains to the space of transformations – less likely to overfit
Larger variability -> wider layers
Small training set -> narrower layers
A good system is deep and wide.
Dong Yu : Deep Learning – What, Why, and How
Prone to overfittingProne to
underfitting
CD
-DN
N-H
MM
| Speed
| Invaria
nt F
eatu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
35
Part 3: Outline
CD-DNN-HMM
Training and Decoding Speed
Invariant Features
Mixed-bandwidth ASR
Multi-lingual ASR
Sequence Discriminative Training
Adaptation
Summary
Dong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
36
Flexible in Using Features(Mohamed et al. 2012, Li et al. 2012)
Dong Yu : Deep Learning – What, Why, and How
Setup WER (%)
CD-GMM-HMM (MFCC, fMPE+BMMI) 34.66 (baseline)
CD-DNN-HMM (MFCC) 31.63 (-8.7%)
CD-DNN-HMM (24 log filter-banks) 30.11 (-13.1%)
CD-DNN-HMM (29 log filter-banks) 30.11 (-13.1%)
CD-DNN-HMM (40 log filter-banks) 29.86 (-13.8%)
CD-DNN-HMM (256 log FFT bins) 32.26 (-6.9%)
Training set: VS-1 72 hours of audio.
Test set: VS-T (26757 words in 9562 utterances).
Both the training and test sets were collected at 16-kHz sampling rate.
Information and features that cannot be effectively exploited within the GMM framework can now be exploited
Table: Comparison of different input features for DNN. All the input features are mean-normalized and with dynamic features. Relative WER reduction in parentheses.
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
37
Dong Yu : Deep Learning – What, Why, and How
Mixed Bandwidth ASR(J. Li et. Al 2012)
Figure: DNN training/testing with 16-kHz and 8-kHz sampling
data
22
7
0-4
K filters
4-8
K filters
22
7
22
7
…22
7
22
7
16k-Hz sampling data
9-13 frames
of input
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
38
Dong Yu : Deep Learning – What, Why, and How
Mixed Bandwidth ASR(J. Li et. Al 2012)
Figure: DNN training/testing with 16-kHz and 8-kHz sampling
data
22
7
0-4
K filters
0or m
pad
22
7
22
7
…22
7
22
7
8k-Hz sampling data
9-13 frames
of input
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
39
Training DataWER (16-
kHz VS-T)
WER (8-
kHz VS-T)
16-kHz VS-1 (B1) 29.96 71.23
8-kHz VS-1 + 8-kHz VS-2 (B2) - 28.98
16-kHz VS-1 + 8-kHz VS-2 (ZP) 28.27 29.33
16-kHz VS-1 + 8-kHz VS-2 (MP) 28.36 29.37
16-kHz VS-1 + 16-kHz VS-2 (UB) 27.47 53.51
Dong Yu : Deep Learning – What, Why, and How
Mixed Bandwidth ASR(J. Li et. Al 2012)
Table: DNN performance on wideband and narrowband
test sets using mixed-bandwidth training data.
B1: baseline 1 B2: baseline 2
ZP: zero padding MP: mean padding
UB: upper bound
Mixed-bandwidth: recover 2/3 of (UB-B1) and ½ of (UB-B2)
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
40
16-kHz DNN (UB) Data-mix DNN (ZP)
LayerMean
(ED)
Variance
(ED)
Mean
(ED)
Variance
(ED)
L1 13.28 3.90 7.32 3.62
L2 10.38 2.47 5.39 1.28
L3 8.04 1.77 4.49 1.27
L4 8.53 2.33 4.74 1.85
L5 9.01 2.96 5.39 2.30
L6 8.46 2.60 4.75 1.57
L7 5.27 1.85 3.12 0.93
Layer Mean (KL) Mean (KL)
Top layer 2.03 0.22Dong Yu : Deep Learning – What, Why, and How
Mixed Bandwidth ASR(J. Li et. Al 2012)
Table: The Euclidean distance (ED) for the output vectors at
each hidden layer (L1-L7) and the KL-divergence (in nats) for
the posterior vectors at the top layer between 8-kHz and 16-
kHz input features
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation
| Sum
mary
41
Part 3: Outline
CD-DNN-HMM
Training and Decoding Speed
Invariant Features
Mixed-bandwidth ASR
Multi-lingual ASR
Sequence Discriminative Training
Adaptation
Summary
Dong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Mu
lti-lingu
al
| Seq
uen
ce Train
| Adap
tation
| Sum
mary
42
Multilingual DNN (Huang et. al 2013)
Dong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Mu
lti-lingu
al
| Seq
uen
ce Train
| Adap
tation
| Sum
mary
43
Sum Is Greater Than One
Dong Yu : Deep Learning – What, Why, and How
FRA DEU ESP ITA
Test Set Size (Words) 40k 37k 18k 31k
Monolingual DNN (%) 28.1 24.0 30.6 24.3
SHL-MDNN (%) 27.1 22.7 29.4 23.5
Relative WER Reduction (%) 3.5 5.4 4.0 3.4
Table: Compare Monolingual DNN and Shared-
Hidden-Layer Multilingual DNN in WER (%)
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Mu
lti-lingu
al
| Seq
uen
ce Train
| Adap
tation
| Sum
mary
44
Hidden Layers Are Transferable
Dong Yu : Deep Learning – What, Why, and How
WER (%)
Baseline (9-hr ENU) 30.9
FRA HLs + Train All Layers 30.6
FRA HLs + Train Softmax Layer 27.3
SHL-MDNN + Train Softmax Layer 25.3
Table: Compare ENU WER with and without Using
Hidden Layers (HLs) Transferred from the FRA DNN.
ENU training data (#. Hours) 3 9 36
Baseline DNN (no Transfer) 38.9 30.9 23.0
SHL-MDNN + Train Softmax Layer 28.0 25.3 22.4
SHL-MDNN + Train All Layers 33.4 28.9 21.6
Best Case Relative WER Reduction (%) 28.2 18.0 6.3
Table: Compare the Effect of Target Language Training
Set Size in WER (%) when SHLs Are Transferred from the
SHL-MDNN
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Mu
lti-lingu
al
| Seq
uen
ce Train
| Adap
tation
| Sum
mary
45
Hidden Layers Are Transferable
Dong Yu : Deep Learning – What, Why, and How
CHN Training Set (Hrs) 3 9 36 139
Baseline - CHN only 45.1 40.3 31.7 29.0
SHL-MDNN Model Transfer 35.6 33.9 28.4 26.6
Relative CER Reduction 21.0 15.7 10.3 8.1
Table: Effectiveness of Cross-Lingual Model Transfer on
CHN Measured in CER Reduction (%).
WER (%)
Baseline (9-hr ENU) 30.9
SHL-MDNN + Train Softmax Layers (no label) 38.7
SHL-MDNN + Train All Layers (no label) 30.2
SHL-MDNN + Train Softmax Layers (use label) 25.3
Table: Compare Features Learned from Multilingual Data
with and without Using Label Information on ENU Data
But only if trained with labeled data
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Mu
lti-lingu
al
| Seq
uen
ce Train
| Adap
tation
| Sum
mary
46
Part 3: Outline
CD-DNN-HMM
Training and Decoding Speed
Invariant Features
Mixed-bandwidth ASR
Multi-lingual ASR
Sequence Discriminative Training
Adaptation
SummaryDong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equ
ence
Tra
in| A
dap
tation
| Sum
mary
47
Sequence Discriminative Training
Sequence-discriminative training can achieve additional gain similar to MPE and BMMI on GMM
State-level minimum Bayes risk (sMBR), MMI and BMMI.
Table: Broad cast news Dev-04f (Sainath et. al 2011)
Dong Yu : Deep Learning – What, Why, and How
Training Criterion 1504 senones
Frame-level Cross Entropy 18.5%
Sequence-level Criterion (sMBR) 17.0%
Table: SWB (309-hr) (Kingsbury et al. 2012)
AM Setup Hub5’00-SWB RT03S-FSH
SI GMM-HMM BMMI+fMPE 18.9% 22.6%
SI DNN-HMM 7 x 2048 (frame CE) 16.1% 18.9%
SA GMM-HMM BMMI+fMPE 15.1% 17.6%
SI DNN-HMM 7 x 2048 (sMBR) 13.3% 16.4%
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equ
ence
Tra
in| A
dap
tation
| Sum
mary
48
Sequence Discriminative Training
To make it work is not easy
Main problem: Overfitting
◦ Lattice is sparse: 300 out of 9000 senones are seen in the lattice
◦ Silence model can eat other models: deletion increased
◦ Depends on information (e.g., LM, surrounding labels) that may be highly mismatched
Solution:
◦ F-smoothing: Hybrid training criterion: sequence-level + frame-level
◦ Frame-dropout: remove frames where the reference hypothesis is missing from the lattice.
Dong Yu : Deep Learning – What, Why, and How
Table: SWB (309-hr 7x2048 0.9s+0.1f) (Su et al. 2013)
AM Setup Hub5’00-SWB RT03S-FSH
SI DNN-HMM frame CE 16.2% 19.6%
SI DNN-HMM MMI w/ F-smoothing) 13.5% 17.1%
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equ
ence
Tra
in| A
dap
tation
| Sum
mary
49
Part 3: Outline
CD-DNN-HMM
Training and Decoding Speed
Invariant Features
Mixed-bandwidth ASR
Multi-lingual ASR
Sequence Discriminative Training
Adaptation
Summary
Dong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Ad
ap
tatio
n
| Sum
mary
50
KL-Regularized Adaptation(Yu et. al 2013)
Huge number of parameters vs. very small adaptation set
Trick: conservative adaptation
◦ the posterior senone distribution estimated from the adapted model should not deviate too far away from that estimated using the unadapted model
◦ Use KL-divergence regularization on the DNN output
◦ Equivalent to change the target distribution to 𝑝 𝑦|𝑥𝑡 ≜ 1 − 𝜌 𝑝 𝑦|𝑥𝑡 + 𝜌𝑝𝑆𝐼 𝑦|𝑥𝑡 .
Dong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Ad
ap
tatio
n
| Sum
mary
51
KL-Regularized Adaptation
Dong Yu : Deep Learning – What, Why, and How
Supervised
Adaptation
Figure: WERs on the SMD dataset using KLD-Reg adaptation for different regularization weights ρ
(numbers in parentheses). The dashed line is the SI DNN baseline.
Unsupervise
d Adaptation
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Ad
ap
tatio
n
| Sum
mary
52
KL-Regularized Adaptation
Dong Yu : Deep Learning – What, Why, and How
SI DNN Supervised Unsupervised SD DNN
Dev Set 16.0% 14.3% (10.6%) 14.9% (6.9%) 15.0%
Test Set 20.9% 19.1% (8.6%) 19.4% (7.2%) 20.2%
Chan Mismatch 14.2% 16.0% 14.6%
Table. WER and Relative WER Reduction (in parentheses)
on Lecture Transcription Task: • SI DNN: trained with 2000hr SWB
• Adaptation set: 6 lectures or 3.8 hrs
• Dev set: 1 lecture 5612 words
• Test set: 1 lecture 8481 words
Problems:
◦ Cannot factorize impact from speaker and channel
◦ Footprint each speaker is too large (there are many solutions to this)
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Ad
ap
tatio
n
| Sum
mary
53
Part 3: Outline
CD-DNN-HMM
Training and Decoding Speed
Invariant Features
Mixed-bandwidth ASR
Multi-lingual ASR
Sequence Training Criterion
Adaptation
Summary
Dong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation |
Su
mm
ary
54
Summary
DNN already outperforms GMM in many tasks◦ Deep neural network is more powerful than the shallow
models including GMMs
◦ Features learned by DNNs are more invariant and selective
◦ DNNs can exploit more info and features difficult to exploit in the GMM framework
Many speech groups (Microsoft, Google, IBM) are adopting it.
Commercial deployment of DNN systems is practical now◦ Many once considered obstacles for adopting DNNs have
been removed
◦ Already commercially deployed by Microsoft and Google
Dong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation |
Su
mm
ary
55
Other Advances
Bottleneck or Tandem feature extracted
from Deep Neural Networks used as
features in conventional GMM-HMM
Convolutional Neural Network
Deep Tensor Neural Network
Deep Segmental Neural Network
Recurrent Neural Network
Dong Yu : Deep Learning – What, Why, and How
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation |
Su
mm
ary
56
To Build a State-Of-the-Art System
Wave
FFT
Filter Bank
MFCC
HLDA
VTLN
fMPE
BMMI
Dong Yu : Deep Learning – What, Why, and How
GMM
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation |
Su
mm
ary
57
Wave
FFT
Filter Bank
MFCC
HLDA
VTLN
fMPE
Seq Train
Better Accuracy and Simpler
Dong Yu : Deep Learning – What, Why, and How
DNN
CD
-DN
N-H
MM
| Speed
| Invarian
t Featu
res | Mix
ed-
ban
dw
idth
| Multi-lin
gual | S
equen
ce Train
| Adap
tation |
Su
mm
ary
58
Questions?
Dong Yu : Deep Learning – What, Why, and How 59
References X. Chen, A. Eversole, G. Li, D. Yu, and F. Seide (2012), “Pipelined Back-
Propagation for Context-Dependent Deep Neural Networks”, Interspeech 2012.
G. E. Dahl, D. Yu, L. Deng, and A. Acero (2012) "Context-Dependent Pre-trained
Deep Neural Networks for Large Vocabulary Speech Recognition", IEEE
Transactions on Audio, Speech, and Language Processing, Jan 2012.
A-r Mohamed, G. Hinton, G. Penn, (2012) "Understanding how Deep Belief
Networks perform acoustic modelling", ICASSP
G. E. Hinton, S. Osindero, Y. Teh (2006) “A fast learning algorithm for deep belief
nets,” Neural Computation, vol. 18, pp. 1527–1554, 2006.
J.-T. Huang, J. Li, D. Yu, L. Deng, Y. Gong, "Cross-Language Knowledge Transfer
Using Multilingual Deep Neural Network With Shared Hidden Layers", ICASSP
2013
B. Kingsbury, T. N. Sainath, H. Soltau (2012), “Scalable Minimum Bayes Risk
Training of Deep Neural Network Acoustic Models Using Distributed Hessian-free
Optimization”, Interspeech.
R. Knies (2012) “Deep-Neural-Network Speech Recognition Debuts”
R. Knies (2013) “DNN Research Improves Bing Voice Search”
J. Li, D. Yu, J.-T. Huang, Y. Gong (2012), "Improving Wideband Speech Recognition
Using Mixed-Bandwidth Training Data In CD-DDD-HMM", SLT 2012.
Dong Yu : Deep Learning – What, Why, and How 60
References G. Li, H. Zhu, G. Cheng, K. Thambiratnam, B. Chitsaz, D. Yu, F. Seide (2012),
"Context-Dependent Deep Neural Networks For Audio Indexing Of Real-Life Data",
SLT 2012.
J. Martens (2010), "Deep Learning via Hessian-free Optimization", ICML.
T. N. Sainath, B. Kingsbury, B. Ramabhadran, P. Fousek, P. Novak, A.-r. Mohamed
(2011) “Making Deep Belief Networks Effective for Large Vocabulary Continuous
Speech Recognition”, ASRU 2011
F. Seide, G. Li and D. Yu (2011) "Conversational Speech Transcription Using
Context-Dependent Deep Neural Networks", Interspeech 2011, pp. 437-440.
F. Seide, G. Li, X. Chen, D. Yu (2011) "Feature engineering in context-dependent
deep neural networks for conversational speech transcription", ASRU 2011, pp. 24-
29.
A. Senior V. Vanhoucke and M. Z. Mao (2011), “Improving the speed of neural
networks on cpus,” in Proc. Deep Learning and Unsupervised Feature Learning
Workshop, NIPS 2011.
T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, B. Ramabhadran, “Low-Rank
Matrix Factorization For Deep Neural Network Training With High-Dimensional
Output Targets”, ICASSP 2014
T. Simonite (2012), “Google Puts Its Virtual Brain Technology to Work”.
Dong Yu : Deep Learning – What, Why, and How 61
References M. Seltzer, D. Yu, Y. Wang, "An Investigation Of Deep Neural Networks For Noise
Robust Speech Recognition", ICASSP 2013
H. Su, G. Li, D. Yu, F. Seide, "Error Back Propagation For Sequence Training Of
Context-Dependent Deep Networks For Conversational Speech Transcription",
ICASSP 2013
D. Yu, L. Deng, F. Seide (2012), “Large Vocabulary Speech Recognition Using Deep
Tensor Neural Networks”, Interspeech 2012.
D. Yu, F. Seide, G. Li, L. Deng (2012), "Exploiting Sparseness In Deep Neural
Networks For Large Vocabulary Speech Recognition", ICASSP 2012
D. Yu, S. Siniscalchi, L. Deng, C.-H. Lee (2012), “Boosting Attribute And Phone
Estimation Accuracies With Deep Neural Networks For Detection-Based Speech
Recognition”, ICASSP 2012, pp. 4169-4172.
D. Yu, K. Yao, H. Su, G. Li, F. Seide, "KL-Divergence Regularized Deep Neural
Network Adaptation For Improved Large Vocabulary Speech Recognition", ICASSP
2013
D. Yu, M. L. Seltzer, J. Li, J.-T. Huang, F. Seide, "Feature Learning in Deep Neural
Networks - Studies on Speech Recognition Tasks", ICLR 2013.
Dong Yu : Deep Learning – What, Why, and How 62