Download pdf - Deep Neural Network and Its Application in Speech Recognition

Deep Neural Network and Its

Application in Speech

Recognition

Dong Yu

Microsoft Research

Thanks to my collaborators:

Li Deng, Frank Seide, Gang Li, Mike Seltzer, Jinyu Li, Jui-Ting Huang, Kaisheng Yao, Jian Xue, Yan Huang, Adam Eversole, George Dahl, Abdel-rahman Mohamed, Xie Chen, Hang Su, Ossama Abdel-Hamid, Eric Wang, Andrew Maas, and many more

Deep Learning – What, Why, and How

– A Tutorial Given at NLP&CC 2013 –

http://research.microsoft.com/c/1040


Part 3: Outline

CD-DNN-HMM

Training and Decoding Speed

Invariant Features

Mixed-bandwidth ASR

Multi-lingual ASR

Sequence Discriminative Training

Adaptation

Summary

Dong Yu : Deep Learning – What, Why, and How

CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

2



Deep Neural Network

A fancy name for multi-layer perceptron (MLP) with many hidden layers.

Each sigmoidal hidden neuron follows Bernoulli distribution

The last layer (softmax layer) follows multinomial distribution

p 𝑙 = 𝑘|𝐡; θ =𝑒𝑥𝑝 𝑖=1

𝐻 𝜆𝑖𝑘ℎ𝑖 + 𝑎𝑘

𝑍 𝒉

Training can be difficult and tricky. Optimization algorithm and strategy can be important.


CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

3



Deep Neural Network

A fancy name for multi-layer perceptron (MLP) with many hidden layers.

Each sigmoidal hidden neuron follows Bernoulli distribution

The last layer (softmax layer) follows multinomial distribution

p 𝑙 = 𝑘|𝐡; θ =𝑒𝑥𝑝 𝑖=1

𝐻 𝜆𝑖𝑘ℎ𝑖 + 𝑎𝑘

𝑍 𝒉

Training can be difficult and tricky. Optimization algorithm and strategy can be important.


CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

4



Restricted Boltzmann Machine(Hinton, Osindero, Teh 2006)

Joint distribution p 𝐯, 𝐡; θ is defined in terms of an energy function E 𝐯, 𝐡; θ

p 𝐯, 𝐡; θ =𝑒𝑥𝑝 −E 𝐯, 𝐡; θ

𝑍

p 𝐯; θ =

𝐡

𝑒𝑥𝑝 −E 𝐯, 𝐡; θ

𝑍=

𝑒𝑥𝑝 −F 𝒗; θ

𝑍

Conditional independence

𝑝 𝐡|𝐯 =

j=0

𝐻−1

𝑝 ℎ𝑗|𝐯

𝑝 𝐯|𝐡 =

i=0

𝑉−1

𝑝 𝑣𝑖|𝐡


Hidden Layer

No within layer connection

Visible Layer

No within layer connection

CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

5



Generative Pretraining a DNN

First learn with all the weights tied◦ equivalent to learning an RBM

Then freeze the first layer of weights and learn the remaining weights (still tied together).◦ equivalent to learning another RBM,

using the aggregated conditional probability on 𝒉0 as the data

◦ Continue the process to train the next layer

Intuitively log 𝑝 𝒗 improves as new layer is added and trained.


CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

6










CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

7










CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

8










CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

9



Discriminative Pretraining Train a single hidden layer DNN using BP (without convergence)

Insert a new hidden layer and train it using BP (without convergence)

Do the same thing till the predefined number of layers is reached

Jointly fine-tune all layers till convergence

Can reduce gradient diffusion problem

Guaranteed to help if done right


CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

10



CD-DNN-HMM: Three Key Components(Dahl, Yu, Deng, Acero 2012)


Model senones (tied

triphone states) directly

Many layers of

nonlinear feature

transformationLong window

of frames

CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

11



Modeling Senones is Critical

ML-trained CD-GMM-HMM generated alignment

was used to generate senone and monophone labels

for training DNNs.


Model monophone senone

GMM-HMM MPE - 36.2

DNN-HMM 1× 2K 41.7 31.9

DNN-HMM 3 ×2k 35.8 30.4

Model monophone senone

GMM-HMM BMMI - 23.6

DNN-HMM 7× 2K 34.9 17.1

Table: 24-hr Voice Search (760 24-mixture senones)

Table: 309-hr SWB (9k 40-mixture senones)

CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

12



Exploiting Neighbor Frames

ML-trained CD-GMM-HMM generated alignment was used to generate senone labels for training DNNs

It seems 23.2% is only slightly better than 23.6% but note that DNN is not trained using sequence-discriminative training but GMM is.

To exploit info in neighbor frames, GMM systems need to use fMPE, region dependent transformation, or tandem structure


Model 1 frame 11 frames

CD-DNN-HMM 1× 𝟒𝟔𝟑𝟒 26.0 22.4

CD-DNN-HMM 7 ×2k 23.2 17.1

Table: 309-hr SWB (GMM-HMM BMMI = 23.6%)

CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

13



Deeper Model is More Powerful(Seide, Li, Yu 2011, Seide, Li, Chen, Yu 2011)

L×NDBN-

Pretrain1×N

DBN-

Pretrain

1×2k 24.2 1×2k 24.2

2×2k 20.4 - -

3×2k 18.4 - -

4 ×2k 17.8 - -

5×2k 17.2 1×3772 22.5

7 ×2k 17.1 1×4634 22.6

9×2k 17.0 - -

9× 𝟏k 17.9 - -5×3k 17.0 - -

1× 16k 22.1


Table: 309-hr SWB (GMM-HMM BMMI = 23.6%)

CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

14



Pretraining Helps but Not Critical(Seide, Li, Yu 2011, Seide, Li, Chen, Yu 2011)

L×NDBN-

PretrainBP LBP

DiscriminativePretrain

1×2k 24.2 24.3 24.3 24.12×2k 20.4 22.2 20.7 20.43×2k 18.4 20.0 18.9 18.64 ×2k 17.8 18.7 17.8 17.85×2k 17.2 18.2 17.4 17.17 ×2k 17.1 17.4 17.4 16.89×2k 17.0 16.9 16.9 -9× 𝟏k 17.9 - - -

5×3k 17.0 - - -

• Stochastic gradient alleviates the optimization problem.

• Large amount of training data alleviates the overfitting

problem.

• Pretraining helps to make BP more robust


CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

15



CD-DNN-HMM Performance Summary(Dahl, Yu, Deng, Acero 2012, Seide, Li, Yu 2011, Chen et al. 2012)

AM Setup Hub5’00-SWB RT03S-FSH

GMM-HMM BMMI (9K 40-mixture) 23.6% 27.4%

DNN-HMM 7 x 2048 15.8% (-33%) 18.5% (-33%)

AM Setup Test

GMM-HMM MPE (760 24-mixture) 36.2%

DNN-HMM 5 layers x 2048 30.1% (-17%)

Table: Voice Search SER (24 hours training)

Table: Switch Board WER (309 hours training)

Table: Switch Board WER (2000 hours training)


GMM-HMM (A) BMMI (18K 72-mixture) 21.7% 23.0%

GMM-HMM (B) BMMI + fMPE 19.6% 20.5%

DNN-HMM 7 x 3076 14.4% (A: -34%

B: -27%)

15.6% (A: -32%

B: -24%)


CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

16



CD-DNN-HMM Real World Deploy

Microsoft audio video indexing service (Knies, 2012) ◦ “It’s a big deal. … tests have demonstrated that the

algorithm provides a 10- to 20-percent relative error reduction and uses about 30 percent less processing time than the best-of-breed speech-recognition algorithms based on so-called Gaussian Mixture Models.”

Google voice search (Simonite, 2012): ◦ “Google is now using these neural networks to recognize

speech more accurately, … "We got between 20 and 25 percent improvement in terms of words that are wrong," says Vincent Vanhoucke, a leader of Google's speech-recognition efforts.

Microsoft Bing voice search (Knies, 2013)◦ "With the judicious use of DNNs, that service has seen its

speed double in recent weeks, and its word-error rate has improved by 15 percent”

Other companies: Baidu, IBM, Nuance, Iflytech, etc.Dong Yu : Deep Learning – What, Why, and How

CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

17



Part 3: Outline

CD-DNN-HMM


Invariant Features

Mixed-bandwidth ASR

Multi-lingual ASR


Adaptation

Summary


CD

-DN

N-H

MM

| Sp

eed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

18



Decoding Speed(Senior et al. 2011)

Well within real time with careful engineering

Setup: (1) DNN: 440:2000X5:7969 (2) single CPU (4) GPU NVIDIA Tesla C2070


TechniqueReal time

factorNote

Floating-point baseline 3.89

Floating-point SSE2 1.36 4-way parallel (16 bytes)

8-bit quantization 1.52Hidden: unsigned char,

weight: signed char

Integer SSSE3 0.51 16-way parallel

Integer SSE4 0.47 Faster 16-32 conversion

Batching 0.36 batches over tens of ms

Lazy evaluation 0.26Assume 30% active

senone

Batched lazy evaluation 0.21 Combine both

CD

-DN

N-H

MM

| Sp

eed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

19



Decoding Speed(Xue et al. 2013)

Observations

◦ Weight sparseness: Only 20-30% weights need to be nonzero to get same performance

◦ Sparse pattern is random: hard to exploit

Low-rank factorization

◦ Model the weight matrix as product of two matrices

◦ SVD: 𝑊 = 𝑈Σ𝑉𝑇 = 𝑈Σ1/2 Σ1/2𝑉𝑇

◦ Keep only top singular values

◦ Model size compressed to 20-30%

◦ Additional three-fold speedup

◦ Faster than GMM decoding


CD

-DN

N-H

MM

| Sp

eed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

20



Training (Single GPU)(Chen et al. 2012)

Relative runtime for different minibatch sizes and GPU/server model types, and corresponding frame accuracy measured after seeing 12 hours of data (429:2kX7:9304).


Fra

me a

ccura

cy

Rela

tive ru

ntim

e

Batch size

CD

-DN

N-H

MM

| Sp

eed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

21



Training (Multi-GPU): Pipeline(Chen et al. 2012)


Input Layer

Hidden Layer 1

Hidden Layer 1

Hidden Layer 2

Output Layer

Hidden Layer 2GPU3

GPU2

GPU1

Will cause delayed update

problem since forward pass

of new batch is calculated

on old weight

Passing hidden activation

is much more efficient than

passing weight or weight

gradient

CD

-DN

N-H

MM

| Sp

eed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

22



Multi-GPU with pipeline

Training (Multi-GPU): Pipeline(Chen et al. 2012)

Training runtimes in minutes per 24h of data for different parallelization configurations. [[·]] denotes divergence, and [·] denotes a WER loss > 0.1% points on the Hub5 set (429:2kX7:9304).


CD

-DN

N-H

MM

| Sp

eed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

23



Training (CPU Cluster)(Dean et al. 2012, picture courtesy of Erdinc Basci)


Model Worker Pool

Parameter Servers Pool

Param Server Machine 1

P1 P2

P3 P4


P5 P6

P7 P8


P9 P10

P11 P12

Model Worker Machine 1

P1 P2 P3

Core1

Core2

Core3

Core4


P4 P5 P6

Core1

Core2

Core3

Core4


P7 P8 P9

Core1

Core2

Core3

Core4


P10 P11 P12

Core1

Core2

Core3

Core4

Da

ta P

art

1

W ∆W W’

Model Worker Pool


P1 P2 P3

Core1

Core2

Core3

Core4


P4 P5 P6

Core1

Core2

Core3

Core4


P7 P8 P9

Core1

Core2

Core3

Core4


P10 P11 P12

Core1

Core2

Core3

Core4

Da

ta P

art

1

Model Worker Pool


P1 P2 P3

Core1

Core2

Core3

Core4


P4 P5 P6

Core1

Core2

Core3

Core4


P7 P8 P9

Core1

Core2

Core3

Core4


P10 P11 P12

Core1

Core2

Core3

Core4

Da

ta P

art

1

W ∆W W’ W ∆W W’

Asynchronous

stochastic

gradient update

Lower

communication

cost when

updating weights

CD

-DN

N-H

MM

| Sp

eed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

24



Training (CPU or GPU Cluster)(Kingsbury et al. 2012, Martens 2010)

Use algorithms that are effective with large batches.

◦ L-BFGS (work well if you use full batch)

◦ Hessian free

Simple data parallelization would work

Key: the communication cost is small compared to the calculation

Better and more scalable training algorithm is still in need.


CD

-DN

N-H

MM

| Sp

eed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

25



Part 3: Outline

CD-DNN-HMM


Invariant Features

Mixed-bandwidth ASR

Multi-lingual ASR


Adaptation

Summary


CD

-DN

N-H

MM

| Speed

| Invaria

nt F

eatu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

26



What Makes ASR Difficult?

Variability, Variability, Variability

Speaker

• Accents

• Dialect

• Style

• Emotion

• Coarticulation

• Reduction

• Pronunciation

• Hesitation

• ⋯

Environment

• Noise

• Side talk

• Reverberation

• ⋯

Device

• Head phone

• Land phone

• Speaker phone

• Cell phone

• ⋯

Interactions between these factors are complicated and nonlinearDong Yu : Deep Learning – What, Why, and How

CD

-DN

N-H

MM

| Speed

| Invaria

nt F

eatu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

27



DNN Is Powerful and Efficient

The desirable model should be powerful and

efficient to represent complex structures

◦ DNN can model any mapping (powerful):

universal approximator -> same as shallow model

◦ DNN is efficient in representation: need fewer

computational units for the same function by

sharing lower-layer results -> better than shallow

models

DNN learns invariant and discriminative

features


CD

-DN

N-H

MM

| Speed

| Invaria

nt F

eatu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

28



DNN Learns Invariant and

Discriminative Features

Joint feature learning and classifier design◦ Bottleneck or tandem feature does

not have this property

Many simple non-linearities = One complicated non-linearity

Features at higher layers are more invariant and discriminative than those at lower layers


Many layers of nonlinear feature

transformation

Log-linear classifier

CD

-DN

N-H

MM

| Speed

| Invaria

nt F

eatu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

29



Higher Layer Features More Invariant(yu et al. 2013)

𝛿𝑙+1 = 𝜎 𝑧𝑙 𝑣𝑙 + 𝛿𝑙 − 𝜎 𝑧𝑙 𝑣𝑙 ≅ 𝑑𝑖𝑎𝑔 𝜎′ 𝑧𝑙 𝑣𝑙 𝑡 𝑤𝑙 𝑇𝛿𝑙 ≤

𝑑𝑖𝑎𝑔 𝜎′ 𝑧𝑙 𝑣𝑙 𝑡 𝑤𝑙 𝑇𝛿𝑙


CD

-DN

N-H

MM

| Speed

| Invaria

nt F

eatu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

30







0.0004883

0.0009766

0.0019531

0.0039063

0.0078125

0.015625

0.03125

0.0625

0.125

0.25

0.5

1

2

4

8

161%

5%

9%

13

%

17

%

21

%

25

%

29

%

33

%

37

%

41

%

45

%

49

%

53

%

57

%

61

%

65

%

69

%

73

%

77

%

81

%

85

%

89

%

93

%

97

%

Wei

gh

t m

ag

nit

ud

e

Percentage of weights whose magnitude is below the threshold

Layer 1 Layer 2 Layer 3 Layer 4

Layer 5 Layer 6 Layer 7

CD

-DN

N-H

MM

| Speed

| Invaria

nt F

eatu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

31





0%

10%

20%

30%

40%

50%

60%

70%

80%

1 2 3 4 5 6

h>0.99

h<0.01


Percentage of saturated hidden units.

H<0.01 are inactive neurons. Higher layers are more sparse


<=0.25 and smaller

when saturated

CD

-DN

N-H

MM

| Speed

| Invaria

nt F

eatu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

32




𝑑𝑖𝑎𝑔 𝜎′ 𝑧𝑙 𝑣𝑙 𝑡 𝑤𝑙 𝑇

If the norm <1 the variation shrinks one layer higher

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 2 3 4 5 6

average

maximum




CD

-DN

N-H

MM

| Speed

| Invaria

nt F

eatu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

33



Noise Robustness(Seltzer, Yu, Wang 2013)

DNN converts input features into more invariant and discriminative features

Robust to environment and speaker variations

Aurora 4 16kHz medium vocabulary noise robustness task◦ Training: 7137 utterances from 83 speakers

◦ Test: 330 utterances from 8 speakers


Setup Set A Set B Set C Set D Avg

GMM-HMM (Baseline) 12.5 18.3 20.5 31.9 23.9

GMM-HMM (MPE + VAT) 7.2 12.8 11.5 19.7 15.3

GMM-HMM + Structured SVM 7.4 12.6 10.7 19.0 14.8

GMM-HMM (VAT-Joint) 5.6 11.0 8.8 17.8 13.4

CD-DNN-HMM (2kx7) 5.6 8.8 8.9 20.0 13.4

CD-DNN-HMM+NAT+dropout 5.4 8.3 7.6 18.5 12.4

Table: WER (%) Comparison on Aurora4 (16k Hz) Dataset.

CD

-DN

N-H

MM

| Speed

| Invaria

nt F

eatu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

34



Balance Overfitting and Underfitting

Achieved by adjusting width and depth -> shallow model is lack of depth adjustment

DNN adds constrains to the space of transformations – less likely to overfit

Larger variability -> wider layers

Small training set -> narrower layers

A good system is deep and wide.


Prone to overfittingProne to

underfitting

CD

-DN

N-H

MM

| Speed

| Invaria

nt F

eatu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

35



Part 3: Outline

CD-DNN-HMM


Invariant Features

Mixed-bandwidth ASR

Multi-lingual ASR


Adaptation

Summary


CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

36



Flexible in Using Features(Mohamed et al. 2012, Li et al. 2012)


Setup WER (%)

CD-GMM-HMM (MFCC, fMPE+BMMI) 34.66 (baseline)

CD-DNN-HMM (MFCC) 31.63 (-8.7%)

CD-DNN-HMM (24 log filter-banks) 30.11 (-13.1%)



CD-DNN-HMM (256 log FFT bins) 32.26 (-6.9%)

Training set: VS-1 72 hours of audio.

Test set: VS-T (26757 words in 9562 utterances).

Both the training and test sets were collected at 16-kHz sampling rate.

Information and features that cannot be effectively exploited within the GMM framework can now be exploited

Table: Comparison of different input features for DNN. All the input features are mean-normalized and with dynamic features. Relative WER reduction in parentheses.

CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

37




Mixed Bandwidth ASR(J. Li et. Al 2012)

Figure: DNN training/testing with 16-kHz and 8-kHz sampling

data

22

7

0-4

K filters

4-8

K filters

22

7

22

7

…22

7

22

7

16k-Hz sampling data

9-13 frames

of input

CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

38





Figure: DNN training/testing with 16-kHz and 8-kHz sampling

data

22

7

0-4

K filters

0or m

pad

22

7

22

7

…22

7

22

7

8k-Hz sampling data

9-13 frames

of input

CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

39



Training DataWER (16-

kHz VS-T)

WER (8-

kHz VS-T)

16-kHz VS-1 (B1) 29.96 71.23

8-kHz VS-1 + 8-kHz VS-2 (B2) - 28.98

16-kHz VS-1 + 8-kHz VS-2 (ZP) 28.27 29.33

16-kHz VS-1 + 8-kHz VS-2 (MP) 28.36 29.37

16-kHz VS-1 + 16-kHz VS-2 (UB) 27.47 53.51



Table: DNN performance on wideband and narrowband

test sets using mixed-bandwidth training data.

B1: baseline 1 B2: baseline 2

ZP: zero padding MP: mean padding

UB: upper bound

Mixed-bandwidth: recover 2/3 of (UB-B1) and ½ of (UB-B2)

CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

40



16-kHz DNN (UB) Data-mix DNN (ZP)

LayerMean

(ED)

Variance

(ED)

Mean

(ED)

Variance

(ED)

L1 13.28 3.90 7.32 3.62

L2 10.38 2.47 5.39 1.28

L3 8.04 1.77 4.49 1.27

L4 8.53 2.33 4.74 1.85

L5 9.01 2.96 5.39 2.30

L6 8.46 2.60 4.75 1.57

L7 5.27 1.85 3.12 0.93

Layer Mean (KL) Mean (KL)

Top layer 2.03 0.22Dong Yu : Deep Learning – What, Why, and How


Table: The Euclidean distance (ED) for the output vectors at

each hidden layer (L1-L7) and the KL-divergence (in nats) for

the posterior vectors at the top layer between 8-kHz and 16-

kHz input features

CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation

| Sum

mary

41



Part 3: Outline

CD-DNN-HMM


Invariant Features

Mixed-bandwidth ASR

Multi-lingual ASR


Adaptation

Summary


CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Mu

lti-lingu

al

| Seq

uen

ce Train

| Adap

tation

| Sum

mary

42



Multilingual DNN (Huang et. al 2013)


CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Mu

lti-lingu

al

| Seq

uen

ce Train

| Adap

tation

| Sum

mary

43



Sum Is Greater Than One


FRA DEU ESP ITA

Test Set Size (Words) 40k 37k 18k 31k

Monolingual DNN (%) 28.1 24.0 30.6 24.3

SHL-MDNN (%) 27.1 22.7 29.4 23.5

Relative WER Reduction (%) 3.5 5.4 4.0 3.4

Table: Compare Monolingual DNN and Shared-

Hidden-Layer Multilingual DNN in WER (%)

CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Mu

lti-lingu

al

| Seq

uen

ce Train

| Adap

tation

| Sum

mary

44



Hidden Layers Are Transferable


WER (%)

Baseline (9-hr ENU) 30.9

FRA HLs + Train All Layers 30.6

FRA HLs + Train Softmax Layer 27.3

SHL-MDNN + Train Softmax Layer 25.3

Table: Compare ENU WER with and without Using

Hidden Layers (HLs) Transferred from the FRA DNN.

ENU training data (#. Hours) 3 9 36

Baseline DNN (no Transfer) 38.9 30.9 23.0

SHL-MDNN + Train Softmax Layer 28.0 25.3 22.4

SHL-MDNN + Train All Layers 33.4 28.9 21.6

Best Case Relative WER Reduction (%) 28.2 18.0 6.3

Table: Compare the Effect of Target Language Training

Set Size in WER (%) when SHLs Are Transferred from the

SHL-MDNN

CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Mu

lti-lingu

al

| Seq

uen

ce Train

| Adap

tation

| Sum

mary

45



Hidden Layers Are Transferable


CHN Training Set (Hrs) 3 9 36 139

Baseline - CHN only 45.1 40.3 31.7 29.0

SHL-MDNN Model Transfer 35.6 33.9 28.4 26.6

Relative CER Reduction 21.0 15.7 10.3 8.1

Table: Effectiveness of Cross-Lingual Model Transfer on

CHN Measured in CER Reduction (%).

WER (%)

Baseline (9-hr ENU) 30.9

SHL-MDNN + Train Softmax Layers (no label) 38.7

SHL-MDNN + Train All Layers (no label) 30.2

SHL-MDNN + Train Softmax Layers (use label) 25.3

Table: Compare Features Learned from Multilingual Data

with and without Using Label Information on ENU Data

But only if trained with labeled data

CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Mu

lti-lingu

al

| Seq

uen

ce Train

| Adap

tation

| Sum

mary

46



Part 3: Outline

CD-DNN-HMM


Invariant Features

Mixed-bandwidth ASR

Multi-lingual ASR


Adaptation

SummaryDong Yu : Deep Learning – What, Why, and How

CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equ

ence

Tra

in| A

dap

tation

| Sum

mary

47




Sequence-discriminative training can achieve additional gain similar to MPE and BMMI on GMM

State-level minimum Bayes risk (sMBR), MMI and BMMI.

Table: Broad cast news Dev-04f (Sainath et. al 2011)


Training Criterion 1504 senones

Frame-level Cross Entropy 18.5%

Sequence-level Criterion (sMBR) 17.0%

Table: SWB (309-hr) (Kingsbury et al. 2012)


SI GMM-HMM BMMI+fMPE 18.9% 22.6%

SI DNN-HMM 7 x 2048 (frame CE) 16.1% 18.9%

SA GMM-HMM BMMI+fMPE 15.1% 17.6%

SI DNN-HMM 7 x 2048 (sMBR) 13.3% 16.4%

CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equ

ence

Tra

in| A

dap

tation

| Sum

mary

48




To make it work is not easy

Main problem: Overfitting

◦ Lattice is sparse: 300 out of 9000 senones are seen in the lattice

◦ Silence model can eat other models: deletion increased

◦ Depends on information (e.g., LM, surrounding labels) that may be highly mismatched

Solution:

◦ F-smoothing: Hybrid training criterion: sequence-level + frame-level

◦ Frame-dropout: remove frames where the reference hypothesis is missing from the lattice.


Table: SWB (309-hr 7x2048 0.9s+0.1f) (Su et al. 2013)


SI DNN-HMM frame CE 16.2% 19.6%

SI DNN-HMM MMI w/ F-smoothing) 13.5% 17.1%

CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equ

ence

Tra

in| A

dap

tation

| Sum

mary

49



Part 3: Outline

CD-DNN-HMM


Invariant Features

Mixed-bandwidth ASR

Multi-lingual ASR


Adaptation

Summary


CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Ad

ap

tatio

n

| Sum

mary

50



KL-Regularized Adaptation(Yu et. al 2013)

Huge number of parameters vs. very small adaptation set

Trick: conservative adaptation

◦ the posterior senone distribution estimated from the adapted model should not deviate too far away from that estimated using the unadapted model

◦ Use KL-divergence regularization on the DNN output

◦ Equivalent to change the target distribution to 𝑝 𝑦|𝑥𝑡 ≜ 1 − 𝜌 𝑝 𝑦|𝑥𝑡 + 𝜌𝑝𝑆𝐼 𝑦|𝑥𝑡 .


CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Ad

ap

tatio

n

| Sum

mary

51



KL-Regularized Adaptation


Supervised

Adaptation

Figure: WERs on the SMD dataset using KLD-Reg adaptation for different regularization weights ρ

(numbers in parentheses). The dashed line is the SI DNN baseline.

Unsupervise

d Adaptation

CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Ad

ap

tatio

n

| Sum

mary

52



KL-Regularized Adaptation


SI DNN Supervised Unsupervised SD DNN

Dev Set 16.0% 14.3% (10.6%) 14.9% (6.9%) 15.0%

Test Set 20.9% 19.1% (8.6%) 19.4% (7.2%) 20.2%

Chan Mismatch 14.2% 16.0% 14.6%

Table. WER and Relative WER Reduction (in parentheses)

on Lecture Transcription Task: • SI DNN: trained with 2000hr SWB

• Adaptation set: 6 lectures or 3.8 hrs

• Dev set: 1 lecture 5612 words

• Test set: 1 lecture 8481 words

Problems:

◦ Cannot factorize impact from speaker and channel

◦ Footprint each speaker is too large (there are many solutions to this)

CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Ad

ap

tatio

n

| Sum

mary

53



Part 3: Outline

CD-DNN-HMM


Invariant Features

Mixed-bandwidth ASR

Multi-lingual ASR

Sequence Training Criterion

Adaptation

Summary


CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation |

Su

mm

ary

54



Summary

DNN already outperforms GMM in many tasks◦ Deep neural network is more powerful than the shallow

models including GMMs

◦ Features learned by DNNs are more invariant and selective

◦ DNNs can exploit more info and features difficult to exploit in the GMM framework

Many speech groups (Microsoft, Google, IBM) are adopting it.

Commercial deployment of DNN systems is practical now◦ Many once considered obstacles for adopting DNNs have

been removed

◦ Already commercially deployed by Microsoft and Google


CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation |

Su

mm

ary

55



Other Advances

Bottleneck or Tandem feature extracted

from Deep Neural Networks used as

features in conventional GMM-HMM

Convolutional Neural Network

Deep Tensor Neural Network

Deep Segmental Neural Network

Recurrent Neural Network


CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation |

Su

mm

ary

56



To Build a State-Of-the-Art System

Wave

FFT

Filter Bank

MFCC

HLDA

VTLN

fMPE

BMMI


GMM

CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation |

Su

mm

ary

57



Wave

FFT

Filter Bank

MFCC

HLDA

VTLN

fMPE

Seq Train

Better Accuracy and Simpler


DNN

CD

-DN

N-H

MM

| Speed

| Invarian

t Featu

res | Mix

ed-

ban

dw

idth

| Multi-lin

gual | S

equen

ce Train

| Adap

tation |

Su

mm

ary

58



Questions?

Dong Yu : Deep Learning – What, Why, and How 59



References X. Chen, A. Eversole, G. Li, D. Yu, and F. Seide (2012), “Pipelined Back-

Propagation for Context-Dependent Deep Neural Networks”, Interspeech 2012.

G. E. Dahl, D. Yu, L. Deng, and A. Acero (2012) "Context-Dependent Pre-trained

Deep Neural Networks for Large Vocabulary Speech Recognition", IEEE

Transactions on Audio, Speech, and Language Processing, Jan 2012.

A-r Mohamed, G. Hinton, G. Penn, (2012) "Understanding how Deep Belief

Networks perform acoustic modelling", ICASSP

G. E. Hinton, S. Osindero, Y. Teh (2006) “A fast learning algorithm for deep belief

nets,” Neural Computation, vol. 18, pp. 1527–1554, 2006.

J.-T. Huang, J. Li, D. Yu, L. Deng, Y. Gong, "Cross-Language Knowledge Transfer

Using Multilingual Deep Neural Network With Shared Hidden Layers", ICASSP

2013

B. Kingsbury, T. N. Sainath, H. Soltau (2012), “Scalable Minimum Bayes Risk

Training of Deep Neural Network Acoustic Models Using Distributed Hessian-free

Optimization”, Interspeech.

R. Knies (2012) “Deep-Neural-Network Speech Recognition Debuts”

R. Knies (2013) “DNN Research Improves Bing Voice Search”

J. Li, D. Yu, J.-T. Huang, Y. Gong (2012), "Improving Wideband Speech Recognition

Using Mixed-Bandwidth Training Data In CD-DDD-HMM", SLT 2012.




References G. Li, H. Zhu, G. Cheng, K. Thambiratnam, B. Chitsaz, D. Yu, F. Seide (2012),

"Context-Dependent Deep Neural Networks For Audio Indexing Of Real-Life Data",

SLT 2012.

J. Martens (2010), "Deep Learning via Hessian-free Optimization", ICML.

T. N. Sainath, B. Kingsbury, B. Ramabhadran, P. Fousek, P. Novak, A.-r. Mohamed

(2011) “Making Deep Belief Networks Effective for Large Vocabulary Continuous

Speech Recognition”, ASRU 2011

F. Seide, G. Li and D. Yu (2011) "Conversational Speech Transcription Using

Context-Dependent Deep Neural Networks", Interspeech 2011, pp. 437-440.

F. Seide, G. Li, X. Chen, D. Yu (2011) "Feature engineering in context-dependent

deep neural networks for conversational speech transcription", ASRU 2011, pp. 24-

29.

A. Senior V. Vanhoucke and M. Z. Mao (2011), “Improving the speed of neural

networks on cpus,” in Proc. Deep Learning and Unsupervised Feature Learning

Workshop, NIPS 2011.

T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, B. Ramabhadran, “Low-Rank

Matrix Factorization For Deep Neural Network Training With High-Dimensional

Output Targets”, ICASSP 2014

T. Simonite (2012), “Google Puts Its Virtual Brain Technology to Work”.




References M. Seltzer, D. Yu, Y. Wang, "An Investigation Of Deep Neural Networks For Noise

Robust Speech Recognition", ICASSP 2013

H. Su, G. Li, D. Yu, F. Seide, "Error Back Propagation For Sequence Training Of

Context-Dependent Deep Networks For Conversational Speech Transcription",

ICASSP 2013

D. Yu, L. Deng, F. Seide (2012), “Large Vocabulary Speech Recognition Using Deep

Tensor Neural Networks”, Interspeech 2012.

D. Yu, F. Seide, G. Li, L. Deng (2012), "Exploiting Sparseness In Deep Neural

Networks For Large Vocabulary Speech Recognition", ICASSP 2012

D. Yu, S. Siniscalchi, L. Deng, C.-H. Lee (2012), “Boosting Attribute And Phone

Estimation Accuracies With Deep Neural Networks For Detection-Based Speech

Recognition”, ICASSP 2012, pp. 4169-4172.

D. Yu, K. Yao, H. Su, G. Li, F. Seide, "KL-Divergence Regularized Deep Neural

Network Adaptation For Improved Large Vocabulary Speech Recognition", ICASSP

2013

D. Yu, M. L. Seltzer, J. Li, J.-T. Huang, F. Seide, "Feature Learning in Deep Neural

Networks - Studies on Speech Recognition Tasks", ICLR 2013.