T-61.184 Informaatiotekniikan erikoiskurssi IV

T-61.184 Informaatiotekniikan erikoiskurssi IV

HMMs and Speech Recognition

Jaakko PeltonenOctober 31, 2001

based on chapter 7 ofD. Jurafsky, J. Martin: Speech and Language Processing

1 Contents

• speech recognition architecture

• HMM, Viterbi, A*

• speech acoustics & features

• computing acoustic probabilities

• speech synthesis

2 Speech Recognition Architecture

• Application: LVCSR

• Large vocabulary: dictionary size 5000 – 60000

• Continuous speech (words not separated)

• Speaker-independentSpeechWaveform

SpectralFeatureVectors

PhoneLikelihoodsP(o|q)

Words

Feature Extraction(Signal Processing)

Phone LikelihoodEstimation (Gaussiansor Neural Networks)

Decoding (Viterbior Stack Decoder)

Neural Net

N-gram Grammar

HMM Lexicon

3 Noisy Channel Model revisited

• acoustic input considered a noisy version of a source sentence

• decoding: find the sentence that most probably generated the input

• problems: - metric for selecting best match? - efficient algorithm for finding best match

4 Bayes revisited

• acoustic input: symbol sequence

• sentence: string of words

• best match metric: probability

• Bayes’ rule:

• observation likelihood • prior probability • acoustic model • language model

)|(maxargˆ OWPWLW

tooooO ,...,,, 321

nwwwwW ,...,,, 321

)()|(maxarg

)(

)()|(maxargˆ

WPWOP

OP

WPWOPW

LW

LW

5 Hidden Markov Models (HMMs)

• previously, Markov chains used to model pronounciation

• forward algorithm phone sequence likelihood

• real input is not symbolic: spectral features

• input symbols do not correspond to machine states

• HMM definition:

• state set Q, • observation symbols O ≠ Q • transition probabilities A • B not limited to 1 and 0 • start and end state(s) • observation likelihoods B

6 HMMs, continued

start0 n1 d3 end4iy2

a01 a12 a23 a34

a11 a22 a33

a24

o1 o2 o3 o4 o5 o6

b1(o1) b1(o2)b1(o3) b1(o5)

b1(o4) b1(o6)

……

Word Model

ObservationSequence

7 The Viterbi Algorithm

• word boundaries unknown segmentation

[ay d ih s hh er d s ah m th ih ng ax b aw …]I just heard something about…

)()],1[(max

)|...,...(max],[ 21,121,...,, 121

tjiji

tttqqq

obaitviterbi

ooojqqqqPjtviterbit

• assumption: dynamic programming invariant

• If ultimate best path for o includes state qi , it includes the best path up to & including qi .

• does not work for all grammars

8 Viterbi, continued

function VITERBI(observations of len T, state-graph) returns best-path num_states NUM-OF-STATES(state-graph) Create a path probability matrix viterbi[num-states+2,T+2] viterbi[0,0]1.0 for each time step t from 0 to T do for each state s from 0 to num-states do for each transition s’ from s specified by state-graph new-scoreviterbi[s,t]*a[s,s’]*bs’(ot) if ((viterbi[s’,t+1] = 0) || (new-score > viterbi[s’,t+1])) then viterbi[s’,t+1]new-score back-pointer[s’,t+1]s Backtrace from highest probability state in the final column of viterbi[] and return path.

• single automaton combine single-word networks add word transition probabilities = bigram probabilities

• states correspond to subphones & context

• beam search

b(ax,aw)left

b(ax,aw)right

b(ax,aw)middle

9 Other Decoders

• Viterbi has problems:

• computes most probable state sequence, not word sequence

• Cannot be used with all language models (only bigrams)

• Solution 1: multiple-pass decoding

• N-best-Viterbi: return N best sentences, sort with more complex model

• word lattice: return directed word graph + word observation likelihoods refine with more complex model

10 A* Decoder

• Viterbi uses an approximation of the forward algorithm: max instead of sum

• A* uses the complete forward algorithm correct observation likelihoods, use any language model

• ’Best-first’ search of word sequence tree:

• priority queue of scored paths to extend

• Algorithm:1. select highest-priority path (pop

queue) 2. create possible extensions (if none,

stop) 3. calculate scores for extended paths (from forward algorithm and

language model)4. add scored paths to queue

11 A* Decoder, continued

(none)1

if30

Alice40

Every25

In4

music32

muscle31

messy25

was29

wants24

walls2

p(acoustic|music)=forward probability

p(if|START)

p(acoustic|if)=forward probability

p(music|if)

A* Decoder, continued

• score of word string w is not (y is the acoustic string)

• reason: path prefix would have higher score

• score: A* evaluation function

• score from start to current string end

• : estimated score of best extension to utterance end

12

)()|( wpwyp

)()()( ** phpgpf

)()|()( WPWAPpg

)(* ph

Acoustic Processingof Speech

• wave characteristics: frequency pitch, amplitude loudness

• visible information: vowel/consonant, voicing, length, fricatives, stop closure

• spectral features: Fourier spectrum / LPC spectrum - peaks characteristic of different sounds formants

• spectrogram: changes over time

• digitization: sampling, quantization

• processing cepstral features / PLP features

13

Computing Acoustic Probabilities

• simple way: vector quantization (cluster feature vectors & count cluster occurrences)

• continuous approach: calculate probability density function (pdf) over observations

• Gaussian pdf: trained with forward-backward algorithm

• Gaussian mixtures, parameter tying

• Multi-layer perceptron (MLP) pdf: trained with error back-propagation

14

Training A Speech Recognizer

• evaluation metric: word error rate 1. Compute minimum edit distance between hypothesized and correct string

2.

• e.g. correct: ”I went to a party” hypothesis: ”Eye went two a bar tea” 3 substitutions, 1 deletion word error rate 80%

• State of the art: word error rate 20% on natural- speech tasks

15

TranscriptCorrect in WordsTotal

DeletionsonsSubstitutiInsertions100 RateError Word

Embedded Training

• models to be trained: - language model: p(wi|wi-1wi-2) - observation likelihoods: bj(ot) - transition probabilities: aij - pronounciation lexicon: HMM state graph

• training data: - corpus of speech wavefiles + word-transcription - large text corpus for language model training - smaller corpus of phonetically labeled speech

• N-gram language model: trained as in Chapter 6

• HMM lexicon structure: built by hand - PRONLEX, CMUdict ”off-the-shelf” pronounciation dictionaries

16

Embedded Training,continued

• HMM parameters: - initial estimate: equal transition probabilities, observation probabilities bootstrapped (labeled speech label for each frame initial Gaussian means / variances)

- MLP systems: forced Viterbi alignment features & correct words given best states labels for each input retrain MLP - Gaussian systems: forward-backward algorithm compute forward & backward probabilities re-estimate a and b. Correct words known prune model

17

Speech Synthesis

• text-to-speech (TTS) system: output is a phone sequence with durations and a FO pitch contour

• waveform concatenation: based on recorded speech database, segmented into short units

• simplest: 1 unit / phone, join units & smooth edges

• triphone models: too many combinations diphones used

• diphones start/end midway through a phone for stability

• does not model pitch & duration changes (prosody)

18

Speech Synthesis, continued

• use signal processing to change prosody

• LPC model separates pitch from spectral envelope to modify pitch: generate pulses in desired pitch, re-excite LPC coefficients modified wave to modify duration: contract/expand coefficient frames

• TD-PSOLA: frames centered around pitchmarks to change pitch: make pitchmarks closer together / further apart to change duration: duplicate / leave out frames recombine: overlap and add frames

19

Speech Synthesis, continued

• problems with speech synthesis: - 1 example/diphone is insufficient - signal processing distortion - subtle effects not modeled

• unit selection: collect several examples/unit with different pitch/duration/linguistic situation

• selection method: - FO contour with 3 values/phone, large unit corpus 1. find candidates (closest phone, duration & FO) rank them by target cost (closeness) 2. measure join quality of neighbour candidates rank joins by concatenation cost - pick best unit set more natural speech

20

Human SpechRecognition

• PLP analysis inspired by human auditory system

• lexical access has common properties: - frequency - parallelism - neighborhood effects - cue-based processing (phoneme restoration) formant structure, timing, voicing, lexical cues, word association, repetition priming

• differences: - time-course: human processing is on-line - other cues: prosody

21

Exercises

1. Hand-simulate the Viterbi algorithm: use the Automaton in Figure 7.8, on input [aa n n ax n iy d]. What is the most probable string of words?

2. Suggest two functions for use in A* decoding. What criteria should the function satisfy for the search to work (i.e. to return the best path)?

22

)(* ph

Documents

T-61.184 Informaatiotekniikan erikoiskurssi IV