18
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic Model (LDM) for Automatic Speech Recognition

Linear Dynamic Model (LDM) for Automatic Speech Recognition

  • Upload
    joshwa

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Linear Dynamic Model (LDM) for Automatic Speech Recognition. PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University. An Example of Kalman Filter (another name of LDM). - PowerPoint PPT Presentation

Citation preview

Page 1: Linear Dynamic Model (LDM) for Automatic Speech Recognition

PhD Candidate: Tao MaAdvised by: Dr. Joseph Picone

Institute for Signal and Information Processing (ISIP)Mississippi State University

Linear Dynamic Model (LDM) for Automatic Speech Recognition

Page 2: Linear Dynamic Model (LDM) for Automatic Speech Recognition

Institute for Signal and Information Processing (ISIP) Page 2 of 20

An Example of Kalman Filter (another name of LDM)

Observation

A Kalman Filter models the position evolution

• In control system engineering, Kalman Filter succeeds to model a system with noisy observations

Filtering: Position at present time (remove noise effect)

Predicting: Position at a future time

Smoothing: Position at a time in the past

Page 3: Linear Dynamic Model (LDM) for Automatic Speech Recognition

Institute for Signal and Information Processing (ISIP) Page 3 of 20

Outline

• Why Linear Dynamic Model (LDM)?

• Linear Dynamic Model

• Pilot experiment: LDM phone classification on Aurora 4

• Hybrid HMM/LDM decoder architecture for LVCSR

• Future work

Page 4: Linear Dynamic Model (LDM) for Automatic Speech Recognition

Institute for Signal and Information Processing (ISIP) Page 4 of 20

HMM & Speech Recognition System

Hidden Markov Models

Page 5: Linear Dynamic Model (LDM) for Automatic Speech Recognition

Institute for Signal and Information Processing (ISIP) Page 5 of 20

Is HMM a perfect model for ASR?

• Progress on improving the accuracy of HMM-based system has slowed in the past decade

• Theory drawbacks of HMM– False assumption that frames are independent and stationary– Spatial correlation is ignored (diagonal covariance matrix)– Limited discrete state space

Accuracy

Time

Clean

Noisy

Page 6: Linear Dynamic Model (LDM) for Automatic Speech Recognition

Institute for Signal and Information Processing (ISIP) Page 6 of 20

Motivation of Linear Dynamic Model (LDM) Research

• Motivation– A model which reflects the characteristics of speech signals will

ultimately lead to great ASR performance improvement

– LDM incorporates frame correlation information of speech signals, which is potential to increase recognition accuracy

– “Filter” characteristic of LDM has potential to improve noise robustness of speech recognition

– Fast growing computation capacity (thanks to Intel) make it realistic to build a two-way HMM/LDM hybrid speech engine

Page 7: Linear Dynamic Model (LDM) for Automatic Speech Recognition

Institute for Signal and Information Processing (ISIP) Page 7 of 20

State Space Model

• Linear Dynamic Model (LDM) is derived from State Space Model

• Equations of State Space Model:

y: observation feature vector

x: corresponding internal state vector

h(): relationship function between y and x at current time

f(): relationship function between current state and all previous states

epsilon: noise component

eta: noise component

Page 8: Linear Dynamic Model (LDM) for Automatic Speech Recognition

Institute for Signal and Information Processing (ISIP) Page 8 of 20

Linear Dynamic Model

• Equations of Linear Dynamic Model (LDM)– Current state is only determined by previous state– H, F are linear transform matrices– Epsilon and Eta are driving components

y: observation feature vector

x: corresponding internal state vector

H: linear transform matrix between y and x

F: linear transform matrix between current state and previous state

epsilon: driving component

eta: driving component

Page 9: Linear Dynamic Model (LDM) for Automatic Speech Recognition

Institute for Signal and Information Processing (ISIP) Page 9 of 20

Kalman filtering for state inference (E-Step of EM training)

Human Being Sound System

Kalman Filtering Estimation

e

For a speech sound,

Page 10: Linear Dynamic Model (LDM) for Automatic Speech Recognition

Institute for Signal and Information Processing (ISIP) Page 10 of 20

RTS smoother for better inference

Standard Kalman Filter Kalman Filter with RTS smoother

• Rauch-Tung-Striebel (RTS) smoother–Additional backward pass to minimize inference error–During EM training, computes the expectations of state statistics

Page 11: Linear Dynamic Model (LDM) for Automatic Speech Recognition

Institute for Signal and Information Processing (ISIP) Page 11 of 20

Maximum Likelihood Parameter Estimation (M-Step of EM training)

Nothing but matrix multiplication!

LDM Parametersaa

ae

ah

ao

aw

ay

b

ch

d

dh

eh

er……

Page 12: Linear Dynamic Model (LDM) for Automatic Speech Recognition

Institute for Signal and Information Processing (ISIP) Page 12 of 20

LDM for Speech Classification

MFCC Feature

………

aa

ch

eh

x y

HMM-Based Recognition

LDM-Based Recognition

MFCC Feature

………

aa

ch

eh

x y

Hypothesis

x^

x^

x^

x^

x^

x^Hypothesis

Page 13: Linear Dynamic Model (LDM) for Automatic Speech Recognition

Institute for Signal and Information Processing (ISIP) Page 13 of 20

Challenges of Applying LDM to ASR

• Segment-based model–frame-to-phoneme information is needed before classification

• EM training is sensitive to state initialization–Each phoneme is modeled by a LDM, EM training is to find a set of parameters for a specific LDM–No good mechanism for state initialization yet

• More parameters than HMM (2~3x)–Currently mono-phone model, to build a tri-phone model for LVCSR would need more training data

Page 14: Linear Dynamic Model (LDM) for Automatic Speech Recognition

Institute for Signal and Information Processing (ISIP) Page 14 of 20

Pilot experiment: phone classification on Aurora 4

• Aurora 4: Wall Street Journal + six kinds of noises–Airport, Babble, Car, Restaurant, Street, and Train

• Frame-to-phone alignment is generated by ISIP decoder (force align mode)

– Adding language model will get 93% accuracy for clean data

• 40 phones, one vs. all classifier

modelclean dataset

(Acc)noisy dataset

(Acc)

HMM 46.9% 36.8%

LDM 49.2% 39.2%

Page 15: Linear Dynamic Model (LDM) for Automatic Speech Recognition

Institute for Signal and Information Processing (ISIP) Page 15 of 20

Hybrid HMM/LDM decoder architecture for LVCSR

Confidence Measurement

Best Hypothesis

Page 16: Linear Dynamic Model (LDM) for Automatic Speech Recognition

Institute for Signal and Information Processing (ISIP) Page 16 of 20

Status and future work

• The development of HMM/LDM hybrid decoder is still in progress

–HMM/LDM hybrid decoder is Expected to be done in 2009–ISIP HMM/SVM hybrid decoder acts as the reference for implementation

• Future work–Research has proved the nonlinear effects in speech signals–Investigate the probability of replacing Kalman filtering with nonlinear filtering (such as Unscented Kalman Filter, Extended Kalman Filter)

Page 17: Linear Dynamic Model (LDM) for Automatic Speech Recognition

Institute for Signal and Information Processing (ISIP) Page 17 of 20

Thank you!

Questions?

Page 18: Linear Dynamic Model (LDM) for Automatic Speech Recognition

Institute for Signal and Information Processing (ISIP) Page 18 of 20

References

• Digalakis, V., “Segment-based Stochastic Models of Spectral Dynamics for Continuous Speech Recognition,” Ph.D. Dissertation, Boston University, Boston, Massachusetts, USA, 1992.

• Digalakis, V., Rohlicek, J. and Ostendorf, M., “ML Estimation of a Stochastic Linear System with the EM Algorithm and Its Application to Speech Recognition,” IEEE Transactions on Speech and Audio Processing, vol. 1, no. 4, pp. 431–442, October 1993.

• Frankel, J., “Linear Dynamic Models for Automatic Speech Recognition,” Ph.D. Dissertation, The Centre for Speech Technology Research, University of Edinburgh, Edinburgh, UK, 2003.