SPEECH RECOGNITION:

Joseph Picone, PhDIntelligent Electronic Systems

Human and Systems EngineeringDepartment of Electrical and Computer Engineering

An Overview of Statistical Modeling of Acoustics

SPEECH RECOGNITION:

Page 2 of 36JHU Summer School on Human Language Technology (2005)

Abstract and Biography

ABSTRACT: Speech technology has quietly become a pervasive influence in our daily lives despite widespread concerns about research progress over the past 20 years. Central to this progress has been the use of advanced statistical models such as hidden Markov models to explain (and predict) variations in the acoustic signal. Generative models that attempt to explain variation in the training data have given way to discriminative models that attempt to directly optimize objective measures such as word error rate. In this talk, we will present a unified view of the acoustic modeling problem and describe typical components in a state of the art speech recognition system.

BIOGRAPHY: Joseph Picone is a Professor in the Department of Electrical and Computer Engineering at Mississippi State University, where he also directs the Intelligent Electronic Systems program at the Center for Advanced Vehicular Systems. He is currently on sabbatical with the Department of Defense. His principal research interests are the development of new statistical modeling techniques for speech recognition. He has previously been employed by Texas Instruments and AT&T Bell Laboratories. Dr. Picone received his Ph.D. in Electrical Engineering from Illinois Institute of Technology in 1983. He is a Senior Member of the IEEE.


• Fundamental challenge: diversity of data that often defies mathematical descriptions or physical constraints.

• Solution: Can we integrate multiple knowledge sources using principles of risk minimization?

Fundamental Challenges: Generalization and Risk

• Why research human language technology?

“Language is the preeminent trait of the human species.”

“I never met someone who wasn’t interested in language.”

“I decided to work on language because it seemed to be the hardest problem to solve.”


• Traditional Output: best word sequence time alignment of information

•Other Outputs: word graphs N-best sentences confidence measures metadata such as speaker

identity, accent, and prosody

•Applications: Information localization data mining emotional state stress, fatigue, deception

Speech Recognition Is Information Extraction


What Makes Acoustic Modeling So Challenging?


•Regions of overlap represent classification error

•Reduce overlap by introducing acoustic and linguistic context

• Comparison of “aa” in “lOck” and “iy” in “bEAt” for conversational speech

Variations in Signal Measurements Are Real


Statistical Approach: Noisy Communication Channel Model


Information Theoretic Basis

ow

o)w|OP(Wo)w,OP(WH(W|O),

log

• Given an observation sequence, O, and a word sequence, W, we want minimal uncertainty about the correct answer(i.e., minimize the conditional entropy):

• To accomplish this, the probability of the word sequence given the observation must increase.

• The mutual information, I(W;O) between W and O:

)|()();( OWHWHOWI

);()()|( OWIWHOWH • Two choices: minimize H(W) or maximize I(W;O)


Relationship to Maximum Likelihood Methods

• Maximizing the mutual information is equivalent to choosing the parameter set to maximize:

R

t wtw

twtMMIE wPMOP

wPMOPF t

1 |

|log

• Maximization implies increasing the numerator term (maximum likelihood estimation – MLE) or decreasing the denominator term (maximum mutual information estimation – MMIE)

• The latter is accomplished by reducing the probabilities of incorrect, or competing, hypotheses.


Core components:

• transduction

• feature extraction

• acoustic modeling (hidden Markov models)

• language modeling (statistical N-grams)

• search (Viterbi beam)

• knowledge sources

Speech Recognition Architectures

Our focus will be on the acoustic modeling components of the system.


Signal Processing in Speech Recognition


Feature Extraction in Speech Recognition


Adding More Knowledge to the Front End


Noise Compensation Techniques


Acoustic Modeling: Hidden Markov Models


Markov Chains and Hidden Markov Models


Why “Hidden” Markov Models?


Doubly Stochastic Systems

• The 1-coin model is observable because the output sequence can be mapped to a specific sequence of state transitions

• The remaining models are hidden because the underlying state sequence cannot be directly inferred from the output sequence


Discrete Markov Models


Markov Models Are Computationally Simple


Training Recipes Are Complex And Iterative


Bootstrapping Is Key In Parameter Reestimation


The Expectation-Maximization Algorithm (EM)


Controlling Model Complexity


Data-Driven Parameter Sharing Is Crucial


Context-Dependent Acoustic Units


Machine Learning in Acoustic Modeling

• Structural optimization often guided by an Occam’s Razor approach

• Trading goodness of fit and model complexity– Examples: MDL, BIC, AIC, Structural Risk

Minimization, Automatic Relevance Determination

Model Complexity

Error

Training SetError

Open-LoopError

Optimum


Summary

• What we haven’t talked about: duration models, adaptation, normalization, confidence measures, posterior-based scoring, hybrid systems, discriminative training, and much, much more…

• Applications of these models to language (Hazen), dialog (Phillips, Seneff), machine translation (Vogel, Papineni), and other HLT applications

• Machine learning approaches to human language technology are still in their infancy (Bilmes)

• A mathematical framework for integration of knowledge and metadata will be critical in the next 10 years.

• Information extraction in a multilingual environment -- a time of great opportunity!


Useful textbooks:

1. X. Huang, A. Acero, and H.W. Hon, Spoken Language Processing - A Guide to Theory, Algorithm, and System Development, Prentice Hall, ISBN: 0-13-022616-5, 2001.

2. D. Jurafsky and J.H. Martin, SPEECH and LANGUAGE PROCESSING: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice-Hall, ISBN: 0-13-095069-6, 2000.

3. F. Jelinek, Statistical Methods for Speech Recognition, MIT Press, ISBN: 0-262-10066-5, 1998.

4. L.R. Rabiner and B.W. Juang, Fundamentals of Speech Recognition, Prentice-Hall, ISBN: 0-13-015157-2, 1993.

5. J. Deller, et. al., Discrete-Time Processing of Speech Signals, MacMillan Publishing Co., ISBN: 0-7803-5386-2, 2000.

6. R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, Second Edition, Wiley Interscience, ISBN: 0-471-05669-3, 2000 (supporting material available at http://rii.ricoh.com/~stork/DHS.html).

7. D. MacKay, Information Theory, Inference, and Learning Algorithms, Cambridge University Press, 2003.

Relevant online resources:

1. “Intelligent Electronic Systems,” http://www.cavs.msstate.edu/hse/ies, Center for Advanced Vehicular Systems, Mississippi State University, Mississippi State, Mississippi, USA, June 2005.

2. Internet-Accessible Speech Recognition Technology,” http://www.cavs.msstate.edu/hse/ies/projects/speech, June 2005.

3. “Speech and Signal Processing Demonstrations,” http://www.cavs.msstate.edu/hse/ies/projects/speech/software/demonstrations, June 2005.

4. “Fundamentals of Speech Recognition,” http://www.isip.msstate.edu/publications/courses/ece_8463, September 2004.

Appendix: Relevant Publications


• Foundation Classes: generic C++ implementations of many popular statistical modeling approaches

Appendix: Relevant Resources

• Fun Stuff: have you seen our campus bus tracking system? Or our Home Shopping Channel commercial?

• Interactive Software: Java applets, GUIs, dialog systems, code generators, and more

• Speech Recognition Toolkits: compare SVMs and RVMs to standard approaches using a state of the art ASR toolkit


• Speech recognition State of the art Statistical (e.g., HMM) Continuous speech Large vocabulary Speaker independent

• Goal: Accelerate research Flexibility, Extensibility, Modular Efficient (C++, Parallel Proc.) Easy to Use (documentation) Toolkits, GUIs

• Benefit: Technology Standard benchmarks Conversational speech

Appendix: Public Domain Speech Recognition Technology


• Extensive online software documentation, tutorials, and training materials

• Graduate courses and web-based instruction

• Self-documenting software

• Summer workshops at which students receive intensive hands-on training

• Jointly develop advanced prototypes in partnerships with commercial entities

Appendix: IES Is More Than Just Software


Appendix: Nonlinear Statistical Modeling of Speech

Expected outcomes:

• Reduced complexity of statistical models for speech (two order of magnitude reduction)

• High performance channel-independent text-independent speaker verification/identification

“Though linear statistical models have dominated the literature for the past 100 years, they have yet to explain simple physical phenomena.”

• Motivated by a phase-locked loop analogy

• Application of principles of chaos and strange attractor theory to acoustic modeling in speech

• Baseline comparisons to other nonlinear methods


Appendix: An Algorithm Retrospective of HLT

1950 1960 1970 1980 1990 2000 2010 2020

Analog Systems

Open Loop Analysis

Discriminative Methods

Expert Systems

Statistical Methods (Generative)

Knowledge Integration

Observations:

• Information theory preceded modern computing.

• Early research focused on basic science.

• Computing capacity has enabled engineering methods.

• We are now “knowledge-challenged.”


1950 1960 1970 1980 1990 2000 2010 2020

Physical Sciences:Physics, Acoustics, Linguistics

Cognitive Sciences:Psychology, Neurophysiology

Engineering Sciences:EE, CPE, Human Factors

Computing Sciences: Comp. Sci., Comp. Ling.

Observations:

• Field continually accumulating new expertise.

• As obvious mathematical techniques have been exhausted (“low-hanging fruit”), there will be a return to basic science (e.g., fMRI brain activity imaging).

A Historical Perspective of Prominent Disciplines


Evolution of Knowledge and Intelligence in HLT Systems

• The solution will require approaches that use expert knowledge from related, more dense domains (e.g., similar languages) and the ability to learn from small amounts of target data (e.g., autonomic).

Source of Knowledge

Performance• A priori expert knowledge created a

generation of highly constrained systems (e.g. isolated word recognition, parsing of written text, fixed-font OCR).

• Statistical methods created a generation of data-driven approaches that supplanted expert systems (e.g., conversational speech to text, speech synthesis, machine translation from parallel text).

… but that isn’t the end of the story …

• A number of fundamental problem still remain (e.g., channel and noise robustness, less dense or less common languages).


Appendix: The Impact of Supercomputers on Research

• Total available cycles for speech research from 1983 to 1993: 90 TeraMIPS

• A Day in a Life: 24 hours of idle time on a modern supercomputer is equivalent to 10 years of speech research at Texas Instruments!

• MS State Empire cluster (1,000 1 GHz processors):90 TeraMIPS per day

• Cost: $1M is the nominal cost for scientific computing (from a 1 MIP VAX in 1983 to a 1,000-node supercomputer)

Documents

SPEECH RECOGNITION: