Linear Dynamic Model (LDM) for Automatic Speech Recognition

PhD Candidate: Tao MaAdvised by: Dr. Joseph Picone

Institute for Signal and Information Processing (ISIP)Mississippi State University

Linear Dynamic Model (LDM) for Automatic Speech Recognition

Institute for Signal and Information Processing (ISIP) Page 2 of 20

An Example of Kalman Filter (another name of LDM)

Observation

A Kalman Filter models the position evolution

• In control system engineering, Kalman Filter succeeds to model a system with noisy observations

Filtering: Position at present time (remove noise effect)

Predicting: Position at a future time

Smoothing: Position at a time in the past


Outline

• Why Linear Dynamic Model (LDM)?

• Linear Dynamic Model

• Pilot experiment: LDM phone classification on Aurora 4

• Hybrid HMM/LDM decoder architecture for LVCSR

• Future work


HMM & Speech Recognition System

Hidden Markov Models


Is HMM a perfect model for ASR?

• Progress on improving the accuracy of HMM-based system has slowed in the past decade

• Theory drawbacks of HMM– False assumption that frames are independent and stationary– Spatial correlation is ignored (diagonal covariance matrix)– Limited discrete state space

Accuracy

Time

Clean

Noisy


Motivation of Linear Dynamic Model (LDM) Research

• Motivation– A model which reflects the characteristics of speech signals will

ultimately lead to great ASR performance improvement

– LDM incorporates frame correlation information of speech signals, which is potential to increase recognition accuracy

– “Filter” characteristic of LDM has potential to improve noise robustness of speech recognition

– Fast growing computation capacity (thanks to Intel) make it realistic to build a two-way HMM/LDM hybrid speech engine


State Space Model

• Linear Dynamic Model (LDM) is derived from State Space Model

• Equations of State Space Model:

y: observation feature vector

x: corresponding internal state vector

h(): relationship function between y and x at current time

f(): relationship function between current state and all previous states

epsilon: noise component

eta: noise component


Linear Dynamic Model

• Equations of Linear Dynamic Model (LDM)– Current state is only determined by previous state– H, F are linear transform matrices– Epsilon and Eta are driving components

y: observation feature vector

x: corresponding internal state vector

H: linear transform matrix between y and x

F: linear transform matrix between current state and previous state

epsilon: driving component

eta: driving component


Kalman filtering for state inference (E-Step of EM training)

Human Being Sound System

Kalman Filtering Estimation

e

For a speech sound,


RTS smoother for better inference

Standard Kalman Filter Kalman Filter with RTS smoother

• Rauch-Tung-Striebel (RTS) smoother–Additional backward pass to minimize inference error–During EM training, computes the expectations of state statistics


Maximum Likelihood Parameter Estimation (M-Step of EM training)

Nothing but matrix multiplication!

LDM Parametersaa

ae

ah

ao

aw

ay

b

ch

d

dh

eh

er……

…


LDM for Speech Classification

MFCC Feature

………

aa

ch

eh

x y

HMM-Based Recognition

LDM-Based Recognition

MFCC Feature

………

aa

ch

eh

x y

Hypothesis

x^

x^

x^

x^

x^

x^Hypothesis


Challenges of Applying LDM to ASR

• Segment-based model–frame-to-phoneme information is needed before classification

• EM training is sensitive to state initialization–Each phoneme is modeled by a LDM, EM training is to find a set of parameters for a specific LDM–No good mechanism for state initialization yet

• More parameters than HMM (2~3x)–Currently mono-phone model, to build a tri-phone model for LVCSR would need more training data


Pilot experiment: phone classification on Aurora 4

• Aurora 4: Wall Street Journal + six kinds of noises–Airport, Babble, Car, Restaurant, Street, and Train

• Frame-to-phone alignment is generated by ISIP decoder (force align mode)

– Adding language model will get 93% accuracy for clean data

• 40 phones, one vs. all classifier

modelclean dataset

(Acc)noisy dataset

(Acc)

HMM 46.9% 36.8%

LDM 49.2% 39.2%


Hybrid HMM/LDM decoder architecture for LVCSR

Confidence Measurement

Best Hypothesis


Status and future work

• The development of HMM/LDM hybrid decoder is still in progress

–HMM/LDM hybrid decoder is Expected to be done in 2009–ISIP HMM/SVM hybrid decoder acts as the reference for implementation

• Future work–Research has proved the nonlinear effects in speech signals–Investigate the probability of replacing Kalman filtering with nonlinear filtering (such as Unscented Kalman Filter, Extended Kalman Filter)


Thank you!

Questions?


References

• Digalakis, V., “Segment-based Stochastic Models of Spectral Dynamics for Continuous Speech Recognition,” Ph.D. Dissertation, Boston University, Boston, Massachusetts, USA, 1992.

• Digalakis, V., Rohlicek, J. and Ostendorf, M., “ML Estimation of a Stochastic Linear System with the EM Algorithm and Its Application to Speech Recognition,” IEEE Transactions on Speech and Audio Processing, vol. 1, no. 4, pp. 431–442, October 1993.

• Frankel, J., “Linear Dynamic Models for Automatic Speech Recognition,” Ph.D. Dissertation, The Centre for Speech Technology Research, University of Edinburgh, Edinburgh, UK, 2003.

Documents

Linear Dynamic Model (LDM) for Automatic Speech Recognition