Upload
rebecca-patterson
View
34
Download
1
Tags:
Embed Size (px)
DESCRIPTION
T-61.184 Informaatiotekniikan erikoiskurssi IV. HMMs and Speech Recognition. based on chapter 7 of D. Jurafsky, J. Martin: Speech and Language Processing. Jaakko Peltonen October 31, 2001. 1. Contents. speech recognition architecture HMM, Viterbi, A* speech acoustics & features - PowerPoint PPT Presentation
Citation preview
T-61.184 Informaatiotekniikan erikoiskurssi IV
HMMs and Speech Recognition
Jaakko PeltonenOctober 31, 2001
based on chapter 7 ofD. Jurafsky, J. Martin: Speech and Language Processing
1 Contents
• speech recognition architecture
• HMM, Viterbi, A*
• speech acoustics & features
• computing acoustic probabilities
• speech synthesis
2 Speech Recognition Architecture
• Application: LVCSR
• Large vocabulary: dictionary size 5000 – 60000
• Continuous speech (words not separated)
• Speaker-independentSpeechWaveform
SpectralFeatureVectors
PhoneLikelihoodsP(o|q)
Words
Feature Extraction(Signal Processing)
Phone LikelihoodEstimation (Gaussiansor Neural Networks)
Decoding (Viterbior Stack Decoder)
Neural Net
N-gram Grammar
HMM Lexicon
3 Noisy Channel Model revisited
• acoustic input considered a noisy version of a source sentence
• decoding: find the sentence that most probably generated the input
• problems: - metric for selecting best match? - efficient algorithm for finding best match
4 Bayes revisited
• acoustic input: symbol sequence
• sentence: string of words
• best match metric: probability
• Bayes’ rule:
• observation likelihood • prior probability • acoustic model • language model
)|(maxargˆ OWPWLW
tooooO ,...,,, 321
nwwwwW ,...,,, 321
)()|(maxarg
)(
)()|(maxargˆ
WPWOP
OP
WPWOPW
LW
LW
5 Hidden Markov Models (HMMs)
• previously, Markov chains used to model pronounciation
• forward algorithm phone sequence likelihood
• real input is not symbolic: spectral features
• input symbols do not correspond to machine states
• HMM definition:
• state set Q, • observation symbols O ≠ Q • transition probabilities A • B not limited to 1 and 0 • start and end state(s) • observation likelihoods B
6 HMMs, continued
start0 n1 d3 end4iy2
a01 a12 a23 a34
a11 a22 a33
a24
o1 o2 o3 o4 o5 o6
b1(o1) b1(o2)b1(o3) b1(o5)
b1(o4) b1(o6)
……
Word Model
ObservationSequence
7 The Viterbi Algorithm
• word boundaries unknown segmentation
[ay d ih s hh er d s ah m th ih ng ax b aw …]I just heard something about…
)()],1[(max
)|...,...(max],[ 21,121,...,, 121
tjiji
tttqqq
obaitviterbi
ooojqqqqPjtviterbit
• assumption: dynamic programming invariant
• If ultimate best path for o includes state qi , it includes the best path up to & including qi .
• does not work for all grammars
8 Viterbi, continued
function VITERBI(observations of len T, state-graph) returns best-path num_states NUM-OF-STATES(state-graph) Create a path probability matrix viterbi[num-states+2,T+2] viterbi[0,0]1.0 for each time step t from 0 to T do for each state s from 0 to num-states do for each transition s’ from s specified by state-graph new-scoreviterbi[s,t]*a[s,s’]*bs’(ot) if ((viterbi[s’,t+1] = 0) || (new-score > viterbi[s’,t+1])) then viterbi[s’,t+1]new-score back-pointer[s’,t+1]s Backtrace from highest probability state in the final column of viterbi[] and return path.
• single automaton combine single-word networks add word transition probabilities = bigram probabilities
• states correspond to subphones & context
• beam search
b(ax,aw)left
b(ax,aw)right
b(ax,aw)middle
9 Other Decoders
• Viterbi has problems:
• computes most probable state sequence, not word sequence
• Cannot be used with all language models (only bigrams)
• Solution 1: multiple-pass decoding
• N-best-Viterbi: return N best sentences, sort with more complex model
• word lattice: return directed word graph + word observation likelihoods refine with more complex model
10 A* Decoder
• Viterbi uses an approximation of the forward algorithm: max instead of sum
• A* uses the complete forward algorithm correct observation likelihoods, use any language model
• ’Best-first’ search of word sequence tree:
• priority queue of scored paths to extend
• Algorithm:1. select highest-priority path (pop
queue) 2. create possible extensions (if none,
stop) 3. calculate scores for extended paths (from forward algorithm and
language model)4. add scored paths to queue
11 A* Decoder, continued
(none)1
if30
Alice40
Every25
In4
music32
muscle31
messy25
was29
wants24
walls2
p(acoustic|music)=forward probability
p(if|START)
p(acoustic|if)=forward probability
p(music|if)
A* Decoder, continued
• score of word string w is not (y is the acoustic string)
• reason: path prefix would have higher score
• score: A* evaluation function
• score from start to current string end
• : estimated score of best extension to utterance end
12
)()|( wpwyp
)()()( ** phpgpf
)()|()( WPWAPpg
)(* ph
Acoustic Processingof Speech
• wave characteristics: frequency pitch, amplitude loudness
• visible information: vowel/consonant, voicing, length, fricatives, stop closure
• spectral features: Fourier spectrum / LPC spectrum - peaks characteristic of different sounds formants
• spectrogram: changes over time
• digitization: sampling, quantization
• processing cepstral features / PLP features
13
Computing Acoustic Probabilities
• simple way: vector quantization (cluster feature vectors & count cluster occurrences)
• continuous approach: calculate probability density function (pdf) over observations
• Gaussian pdf: trained with forward-backward algorithm
• Gaussian mixtures, parameter tying
• Multi-layer perceptron (MLP) pdf: trained with error back-propagation
14
Training A Speech Recognizer
• evaluation metric: word error rate 1. Compute minimum edit distance between hypothesized and correct string
2.
• e.g. correct: ”I went to a party” hypothesis: ”Eye went two a bar tea” 3 substitutions, 1 deletion word error rate 80%
• State of the art: word error rate 20% on natural- speech tasks
15
TranscriptCorrect in WordsTotal
DeletionsonsSubstitutiInsertions100 RateError Word
Embedded Training
• models to be trained: - language model: p(wi|wi-1wi-2) - observation likelihoods: bj(ot) - transition probabilities: aij - pronounciation lexicon: HMM state graph
• training data: - corpus of speech wavefiles + word-transcription - large text corpus for language model training - smaller corpus of phonetically labeled speech
• N-gram language model: trained as in Chapter 6
• HMM lexicon structure: built by hand - PRONLEX, CMUdict ”off-the-shelf” pronounciation dictionaries
16
Embedded Training,continued
• HMM parameters: - initial estimate: equal transition probabilities, observation probabilities bootstrapped (labeled speech label for each frame initial Gaussian means / variances)
- MLP systems: forced Viterbi alignment features & correct words given best states labels for each input retrain MLP - Gaussian systems: forward-backward algorithm compute forward & backward probabilities re-estimate a and b. Correct words known prune model
17
Speech Synthesis
• text-to-speech (TTS) system: output is a phone sequence with durations and a FO pitch contour
• waveform concatenation: based on recorded speech database, segmented into short units
• simplest: 1 unit / phone, join units & smooth edges
• triphone models: too many combinations diphones used
• diphones start/end midway through a phone for stability
• does not model pitch & duration changes (prosody)
18
Speech Synthesis, continued
• use signal processing to change prosody
• LPC model separates pitch from spectral envelope to modify pitch: generate pulses in desired pitch, re-excite LPC coefficients modified wave to modify duration: contract/expand coefficient frames
• TD-PSOLA: frames centered around pitchmarks to change pitch: make pitchmarks closer together / further apart to change duration: duplicate / leave out frames recombine: overlap and add frames
19
Speech Synthesis, continued
• problems with speech synthesis: - 1 example/diphone is insufficient - signal processing distortion - subtle effects not modeled
• unit selection: collect several examples/unit with different pitch/duration/linguistic situation
• selection method: - FO contour with 3 values/phone, large unit corpus 1. find candidates (closest phone, duration & FO) rank them by target cost (closeness) 2. measure join quality of neighbour candidates rank joins by concatenation cost - pick best unit set more natural speech
20
Human SpechRecognition
• PLP analysis inspired by human auditory system
• lexical access has common properties: - frequency - parallelism - neighborhood effects - cue-based processing (phoneme restoration) formant structure, timing, voicing, lexical cues, word association, repetition priming
• differences: - time-course: human processing is on-line - other cues: prosody
21
Exercises
1. Hand-simulate the Viterbi algorithm: use the Automaton in Figure 7.8, on input [aa n n ax n iy d]. What is the most probable string of words?
2. Suggest two functions for use in A* decoding. What criteria should the function satisfy for the search to work (i.e. to return the best path)?
22
)(* ph