Upload
alisha-nelson
View
233
Download
0
Tags:
Embed Size (px)
Citation preview
LML Speech Recognition 2008LML Speech Recognition 2008 11
Speech RecognitionSpeech RecognitionIntroduction IIntroduction I
E.M. Bakker E.M. Bakker
LML Speech Recognition 2008LML Speech Recognition 2008 22
Speech RecognitionSpeech Recognition
Some ApplicationsSome Applications
An OverviewAn Overview
General ArchitectureGeneral Architecture
Speech ProductionSpeech Production
Speech PerceptionSpeech Perception
LML Speech Recognition 2008LML Speech Recognition 2008 33
Speech RecognitionSpeech Recognition
Goal: Automatically extract the string of Goal: Automatically extract the string of words spoken from the speech signalwords spoken from the speech signal
SpeechRecognition
Words“How are you?”
Speech Signal
LML Speech Recognition 2008LML Speech Recognition 2008 44
Speech RecognitionSpeech Recognition
Goal: Automatically extract the string of Goal: Automatically extract the string of words spoken from the speech signalwords spoken from the speech signal
SpeechRecognition
Words“How are you?”
Speech Signal
How is SPEECH produced? Characteristics of Acoustic Signal
How is SPEECH produced? Characteristics of Acoustic Signal
LML Speech Recognition 2008LML Speech Recognition 2008 55
Speech RecognitionSpeech Recognition
Goal: Automatically extract the string of Goal: Automatically extract the string of words spoken from the speech signalwords spoken from the speech signal
SpeechRecognition
Words“How are you?”
Speech Signal
How is SPEECH perceived?=> Important FeaturesHow is SPEECH perceived?=> Important Features
LML Speech Recognition 2008LML Speech Recognition 2008 66
Speech RecognitionSpeech Recognition
Goal: Automatically extract the string of Goal: Automatically extract the string of words spoken from the speech signalwords spoken from the speech signal
SpeechRecognition
Words“How are you?”
Speech Signal
What LANGUAGE is spoken?=> Language ModelWhat LANGUAGE is spoken?=> Language Model
LML Speech Recognition 2008LML Speech Recognition 2008 77
Speech RecognitionSpeech Recognition
Goal: Automatically extract the string of Goal: Automatically extract the string of words spoken from the speech signalwords spoken from the speech signal
SpeechRecognition
Words“How are you?”
Speech Signal
What is in the BOX?What is in the BOX?
Input Speech
AcousticFront-end
AcousticFront-end
Acoustic ModelsP(A/W)
Acoustic ModelsP(A/W)
RecognizedUtterance
SearchSearchLanguage
Model P(W)
LML Speech Recognition 2008LML Speech Recognition 2008 88
Important ComponentsImportant Componentsof General SR Architectureof General SR Architecture
Speech SignalsSpeech Signals
Signal Processing FunctionsSignal Processing Functions
ParameterizationParameterization
Acoustic Modeling (Learning Phase)Acoustic Modeling (Learning Phase)
Language Modeling (Learning Phase)Language Modeling (Learning Phase)
Search Algorithms and Data StructuresSearch Algorithms and Data Structures
EvaluationEvaluation
LML Speech Recognition 2008LML Speech Recognition 2008 99
MessageSource
LinguisticChannel
ArticulatoryChannel
AcousticChannel
Observable: Message Words Sounds Features
Speech Recognition Problem: P(W|A), where A is acoustic signal, W words spoken
Recognition ArchitecturesRecognition ArchitecturesA Communication Theoretic ApproachA Communication Theoretic Approach
Objective: minimize the word error rateApproach: maximize P(W|A) during training
Components:
• P(A|W) : acoustic model (hidden Markov models, mixtures)
• P(W) : language model (statistical, finite state networks, etc.)
The language model typically predicts a small set of next words based on knowledge of a finite number of previous words (N-grams).
Bayesian formulation for speech recognition:
• P(W|A) = P(A|W) P(W) / P(A), A is acoustic signal, W words spoken
LML Speech Recognition 2008LML Speech Recognition 2008 1010
Input Speech
Recognition ArchitecturesRecognition Architectures
AcousticFront-end
AcousticFront-end
• The signal is converted to a sequence of feature vectors based on spectral and temporal measurements.
Acoustic ModelsP(A|W)
Acoustic ModelsP(A|W)
• Acoustic models represent sub-word units, such as phonemes, as a finite-state machine in which:
• states model spectral structure and• transitions model temporal structure.
RecognizedUtterance
SearchSearch
• Search is crucial to the system, since many combinations of words must be investigated to find the most probable word sequence.
• The language model predicts the next set of words, and controls which models
are hypothesized.Language Model
P(W)
LML Speech Recognition 2008LML Speech Recognition 2008 1111
ASR ArchitectureASR Architecture
Common BaseClassesConfiguration and Specification
Speech Database, I/O
Feature Extraction Recognition: Searching Strategies
Evaluators
Language Models
HMM Initialisation and Training
LML Speech Recognition 2008LML Speech Recognition 2008 1212
Signal ProcessingSignal ProcessingFunctionalityFunctionality
Acoustic Transducers Acoustic Transducers Sampling and ResamplingSampling and ResamplingTemporal AnalysisTemporal AnalysisFrequency Domain AnalysisFrequency Domain AnalysisCeps-tral AnalysisCeps-tral AnalysisLinear Prediction and LP-Based Linear Prediction and LP-Based RepresentationsRepresentationsSpectral NormalizationSpectral Normalization
LML Speech Recognition 2008LML Speech Recognition 2008 1313
FourierTransform
FourierTransform
CepstralAnalysis
CepstralAnalysis
PerceptualWeighting
PerceptualWeighting
TimeDerivative
TimeDerivative
Time Derivative
Time Derivative
Energy+
Mel-Spaced Cepstrum
Delta Energy+
Delta Cepstrum
Delta-Delta Energy+
Delta-Delta Cepstrum
Input Speech
• Incorporate knowledge of the nature of speech sounds in measurement of the features.
• Utilize rudimentary models of human perception.
Acoustic Modeling: Acoustic Modeling: Feature ExtractionFeature Extraction
• Measure features 100 times per sec.
• Use a 25 msec window forfrequency domain analysis.
• Include absolute energy and 12 spectral measurements.
• Time derivatives to model spectral change.
LML Speech Recognition 2008LML Speech Recognition 2008 1414
Acoustic ModelingAcoustic Modeling
Dynamic ProgrammingDynamic Programming
Markov ModelsMarkov Models
Parameter EstimationParameter Estimation
HMM TrainingHMM Training
Continuous MixturesContinuous Mixtures
Decision TreesDecision Trees
Limitations and Practical Issues of HMMLimitations and Practical Issues of HMM
LML Speech Recognition 2008LML Speech Recognition 2008 1515
• Acoustic models encode the temporal evolution of the features (spectrum).
• Gaussian mixture distributions are used to account for variations in speaker, accent, and pronunciation.
• Phonetic model topologies are simple left-to-right structures.
• Skip states (time-warping) and multiple paths (alternate pronunciations) are also common features of models.
• Sharing model parameters is a common strategy to reduce complexity.
Acoustic ModelingAcoustic ModelingHidden Markov ModelsHidden Markov Models
LML Speech Recognition 2008LML Speech Recognition 2008 1616
• Closed-loop data-driven modeling supervised from a word-level transcription.
• The expectation/maximization (EM) algorithm is used to improve our parameter estimates.
• Computationally efficient training algorithms (Forward-Backward) have been crucial.
• Batch mode parameter updates are typically preferred.
• Decision trees are used to optimize parameter-sharing, system complexity, and the use of additional linguistic knowledge.
Acoustic Modeling: Acoustic Modeling: Parameter EstimationParameter Estimation
• Initialization
• Single Gaussian Estimation
• 2-Way Split
• Mixture Distribution Reestimation
• 4-Way Split
• Reestimation
•••
LML Speech Recognition 2008LML Speech Recognition 2008 1717
Language ModelingLanguage Modeling
Formal Language TheoryFormal Language Theory
Context-Free GrammarsContext-Free Grammars
N-Gram Models and ComplexityN-Gram Models and Complexity
SmoothingSmoothing
LML Speech Recognition 2008LML Speech Recognition 2008 1818
Language ModelingLanguage Modeling
LML Speech Recognition 2008LML Speech Recognition 2008 1919
Language Modeling: Language Modeling: N-GramsN-Grams
Bigrams (SWB):
• Most Common: “you know”, “yeah SENT!”, “!SENT um-hum”, “I think”• Rank-100: “do it”, “that we”, “don’t think”• Least Common: “raw fish”, “moisture content”,
“Reagan Bush”
Trigrams (SWB):
• Most Common: “!SENT um-hum SENT!”, “a lot of”, “I don’t know”
• Rank-100: “it was a”, “you know that”• Least Common: “you have parents”,
“you seen Brooklyn”
Unigrams (SWB):
• Most Common: “I”, “and”, “the”, “you”, “a”• Rank-100: “she”, “an”, “going”• Least Common: “Abraham”, “Alastair”, “Acura”
LML Speech Recognition 2008LML Speech Recognition 2008 2020
LM: LM: Integration of Natural LanguageIntegration of Natural Language
• Natural language constraints can be easily incorporated.
• Lack of punctuation and search space size pose problems.
• Speech recognition typically produces a word-level
time-aligned annotation.
• Time alignments for other levels of information also available.
LML Speech Recognition 2008LML Speech Recognition 2008 2121
Search Algorithms and Search Algorithms and Data StructuresData Structures
Basic Search AlgorithmsBasic Search Algorithms
Time Synchronous SearchTime Synchronous Search
Stack DecodingStack Decoding
Lexical TreesLexical Trees
Efficient TreesEfficient Trees
LML Speech Recognition 2008LML Speech Recognition 2008 2222
Dynamic Programming-Based SearchDynamic Programming-Based Search
• Search is time synchronous and left-to-right.
• Arbitrary amounts of silence must be permitted between each word.
• Words are hypothesized many times with different start/stop times, which significantly increases search complexity.
LML Speech Recognition 2008LML Speech Recognition 2008 2323
Input Speech
Recognition ArchitecturesRecognition Architectures
AcousticFront-end
AcousticFront-end
• The signal is converted to a sequence of feature vectors based on spectral and temporal measurements.
Acoustic ModelsP(A|W)
Acoustic ModelsP(A|W)
• Acoustic models represent sub-word units, such as phonemes, as a finite-state machine in which:
• states model spectral structure and• transitions model temporal structure.
RecognizedUtterance
SearchSearch
• Search is crucial to the system, since many combinations of words must be investigated to find the most probable word sequence.
• The language model predicts the next set of words, and controls which models
are hypothesized.Language Model
P(W)
LML Speech Recognition 2008LML Speech Recognition 2008 2424
Speech RecognitionSpeech Recognition
Goal: Automatically extract the string of Goal: Automatically extract the string of words spoken from the speech signalwords spoken from the speech signal
SpeechRecognition
Words“How are you?”
Speech Signal
How is SPEECH produced?How is SPEECH produced?