27
Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center Hong Kong University of Science and Technology

Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

Sparse Models for Speech Recognition

Weibin Zhang and Pascale Fung

Human Language Technology Center

Hong Kong University of Science and Technology

Page 2: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

Outline

• Introduction to speech recognition

• Motivations for sparse models

• Maximum likelihood training of sparse models

• ML training of sparse banded models

• Discriminative training of sparse models

• Conclusions

ECE/HKUST Weibin Zhang 2

Page 3: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

Speech Recognition & its Applications

1. Automatic Speech Recognition (ASR): • Convert speech wave into text automatically

2. Applications: • Office/business systems:

• Manufacturing

• Telecommunications

• Mobile telephony

• Home Automation

• Navigation

• ………

ECE/HKUST Weibin Zhang 3

Page 4: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

History of ASR

• Technical Point of View

ECE/HKUST Weibin Zhang 4

Page 5: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

ASR Research -- Overview

• Statistical approaches lead in all area.

• Still big gap between human and machine performance...however

• Useful systems have been built which are changing the way we interact with the world

ECE/HKUST Weibin Zhang

...within five years people will discard their keyboards and interact with computers using touch-screens and voice controls...

Bill Gates, Feb 2008

5

Page 6: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

Statistical speech recognition system

ECE/HKUST Weibin Zhang 6

Page 7: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

Statistical speech recognition system

• Language Model:

– P(“recognize speech”) >> P(“wreck a nice beach”)

• Dictionary:

– Wreck r e k

– Beach b i th

• Acoustic Model:

– P(O|”recognize speech”)

ECE/HKUST Weibin Zhang 7

Page 8: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

Statistical speech recognition system

ECE/HKUST Weibin Zhang 8

Page 9: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

Acoustic modeling

• Left-to-right hidden Markov models (HMMs)

• GMM-HMM based acoustic models

• 𝑝 𝒐𝑡 𝑠𝑗 = 𝑐𝑗𝑚𝑚 𝑁(𝒐𝑡: 𝝁𝑗𝑚, 𝚺𝑗𝑚)

• Θ = 𝑎𝑖𝑗 , 𝑏𝑗(𝒐𝑡) = {𝑎𝑖𝑗 , 𝑐𝑗𝑚, 𝒖𝑗𝑚, 𝚺𝑗𝑚}

ECE/HKUST Weibin Zhang 9

Page 10: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

Evaluation of ASR system

• Word error rate (WER) = 1 – accuracy

• Real time factor (RTF)

ECE/HKUST Weibin Zhang 10

Page 11: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

Covariance modeling

ECE/HKUST Weibin Zhang 11

Page 12: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

Covariance modeling

ECE/HKUST Weibin Zhang

Less Training data

12

Page 13: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

ML training of sparse models

• Maximum likelihood (ML) training

𝚯 = argmax𝚯

{log(𝑃(𝑶|𝚯))}

• The proposed new objective function

𝐿 𝚯 = log(𝑃(𝑶|𝚯)) − 𝜌| 𝑪𝑖𝑚 |1𝑀𝑚=1

𝑆−1𝑖=2

ECE/HKUST Weibin Zhang 13

Page 14: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

Maximizing the auxiliary function

ECE/HKUST Weibin Zhang

• The precision matrices can be updated using 𝑪 𝑖𝑚 = argmax

𝑪𝑖𝑚≻0{logdet𝑪𝑖𝑚 − trace 𝑺𝑖𝑚𝑪𝑖𝑚 − 𝜆| 𝑪𝑖𝑚 |1}

– 𝜆 =2𝜌

𝛾𝑖𝑚and 𝑺𝑖𝑚 is the sample covariance matrix.

– Convex optimization or other more efficient methods (e.g. graphical lasso)

14

Page 15: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

Experiments on the WSJ data

• Experimental setup – Training, development and testing data sets

– Standard bigram language model

– Feature vector: 39-dimension MFCC

– 39 phonemes for English (393 triphones)

– 2843 tied HMM states

ECE/HKUST Weibin Zhang 15

Page 16: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

Tuning results on the dev. data set

ECE/HKUST Weibin Zhang 16

Page 17: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

WER on the testing data set

• Our result of 8.77% WER is comparable to the 8.6% WER reported in (Ko & Mak, 2011) using a similar testing configuration, but using 70 hours of training data

Model type #Gaussians WER Rel. improv. Significant?

Full 2 10.5 -7.1 No

Diagonal 10 9.84 ---- ----

Sparse 4 8.77 10.9% Yes

ECE/HKUST Weibin Zhang 17

Page 18: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

Sparse banded models

Sparse models Sparse models Sparse banded

feature reorder models

ECE/HKUST Weibin Zhang 18

Page 19: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

Training of sparse banded models

• Weighted lasso: 𝑓 𝑪𝑖𝑚 = − 𝑯 ∗ 𝑪𝑖𝑚 1

• 𝑯 𝑘, 𝑙 = ∞⟹ 𝑪𝑖𝑚 𝑘, 𝑙 = 0

ECE/HKUST Weibin Zhang

Sparse banded

Diagonal Full

19

Page 20: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

Importance of the feature order

• 𝑶~𝑁 𝝁, 𝚺 ; 𝑪 = 𝚺−1; 𝑪𝑖𝑗 = 0 ⟹ 𝑜𝑖 and 𝑜𝑗 are conditionally independent (CI), given other variables.

• Rearrange the feature order so that 𝑜𝑖 and 𝑜𝑗 are CI if 𝑖 − 𝑗 > 𝑏

• Three orders are investigated: – HTK order : 𝑚1⋯𝑚13Δ𝑚1⋯Δ𝑚13ΔΔ𝑚1⋯ΔΔ𝑚13

– Knowledge-based order : 𝑚1Δ𝑚1ΔΔm1⋯𝑚13Δm13ΔΔ𝑚13

– Data-driven order : 𝑚1ΔΔ𝑚1⋯Δ𝑚6Δ𝑚10

ECE/HKUST Weibin Zhang 20

Page 21: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

Results on the development data

ECE/HKUST Weibin Zhang 21

Page 22: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

Results on the test data

Model type #Gaussians WER Rel. improv. Significant?

Full 2 10.5 -7.1 No

Diagonal 10 9.84 ---- ----

Sparse 4 8.77 10.9% Yes

Band8 4 8.91 9.5 Yes

ECE/HKUST Weibin Zhang 22

Page 23: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

Decoding time

• Sparse banded modes are the fastest since: 1) smaller searching beam-

widths; 2) less model parameters.

ECE/HKUST Weibin Zhang 23

Page 24: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

Discriminative training

• MMI objective function: Θ = 𝐴𝑟𝑔𝑚𝑎𝑥{

Θlog 𝑃 𝒘𝑟 𝑶,𝚯 }

• New Objective function

𝐿 𝚯 = log 𝑃 𝒘𝑟 𝑶,𝚯 − 𝜌| 𝑪𝑖𝑚 |1𝑀

𝑚=1

𝑆−1

𝑖=2

• A valid weak-sense auxiliary function is Q 𝚯;𝚯′ =Qn 𝚯;𝚯′ -Qd 𝚯;𝚯′ +Qs 𝚯;𝚯′ +QI 𝚯;𝚯′

− 𝜌| 𝑪𝑖𝑚 |1𝑀𝑚=1

𝑆−1𝑖=2

ECE/HKUST Weibin Zhang

Same as ML training

Improve generalization

Ensure stability

Regularization term

24

Page 25: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

Results on the WSJ testing data

Model type #Gaussians ML training MMI

Full 2 11.68 9.18

Diagonal 10 9.84 9.04

Diagonal+ STC 10 9.26 8.66

Sparse 4 8.55 8.05

ECE/HKUST Weibin Zhang 25

Page 26: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

Summary

• Sparse models are effective in dealing with the problems that conventional diagonal and full covariance models face: computation, incorrect model assumptions and over-fitting when training data is insufficient.

• We derive the overall training process under the HMM framework using both maximum likelihood training and discriminative training.

• The proposed sparse models subsume the traditional diagonal and full covariance models as special cases.

ECE/HKUST Weibin Zhang 26

Page 27: Sparse Models for Speech Recognitioncjhsieh/covselect_icml14/slides/Zhang.pdf · Sparse Models for Speech Recognition Weibin Zhang and Pascale Fung Human Language Technology Center

Thank you!

ECE/HKUST Weibin Zhang 27