Download pdf - Lecture 10: ASR: Sequence Recognitiondpwe/e6820/lectures/E6820-L10-ASR... · 2003. 4. 20. · E6820 SAPR - Dan Ellis L10 - Sequence recognition 2003-04-21 - 1 EE E6820: Speech & Audio

E6820 SAPR - Dan Ellis L10 - Sequence recognition 2003-04-21 - 1

EE E6820: Speech & Audio Processing & Recognition

Lecture 10:ASR: Sequence Recognition

Signal template matching

Statistical sequence recognition

Acoustic modeling

The Hidden Markov Model (HMM)

Dan Ellis http://www.ee.columbia.edu/~dpwe/e6820/

Columbia University Dept. of Electrical EngineeringSpring 2003

1

2

3

4

http://www.ee.columbia.edu/~dpwe/e6820/



• Framewise comparison of unknown word and stored templates:

- distance metric?- comparison across templates?- constraints?

1

10 20 30 40 50 time /frames

10

20

30

40

50

60

70

Ref

eren

ce

Test

ON

ET

WO

TH

RE

EF

OU

RF

IVE


Dynamic Time Warp (DTW)

• Find lowest-cost constrained path:

- matrix

d

(

i

,

j

)

of distances between input frame

f

i

and reference frame

r

j

- allowable predecessors & transition costs

T

xy

• Best path via traceback from final state

- store predecessors for each

(

i

,

j

)

Input frames fi

Ref

eren

ce f

ram

es r

j

D(i,j) = d(i,j) + min{ }D(i-1,j) + T10D(i,j-1) + T01D(i-1,j-1) + T11D(i-1,j)D(i-1,j) D(i-1,j)

T10

T01

T 11 Local match cost

Lowest cost to (i,j)

Best predecessor(including transition cost)


DTW-based recognition

• Reference templates for each possible word

• For isolated words:

- mark endpoints of input word- calculate scores through each template (+prune)- choose best

• For continuous speech

- one matrix of template slices;special-case constraints at word ends

Ref

eren

ce

Input frames

ON

ET

WO

TH

RE

EF

OU

R


DTW-based recognition (2)

+ Successfully handles timing variation

+ Able to recognize speech at reasonable cost

- Distance metric?

- pseudo-Euclidean space?

- Warp penalties?

- How to choose templates?

- several templates per word?- choose ‘most representative’?- align and average?

→

need a rigorous foundation...


Outline



- state-based modeling

Acoustic modeling


1

2

3

4



• DTW limited because it’s hard to optimize

- interpretation of distance, transition costs?

• Need a theoretical foundation: Probability

• Formulate recognition as MAP choice among models:

-

X

= observed features-

M

j

= word-sequence models

-

Θ

= all current parameters

2

M*

p M j X Θ,( )M jargmax =


Statistical formulation (2)

• Can rearrange via Bayes’ rule (& drop

p

(

X

)

):

-

p

(

X

|

M

j

)

= likelihood of observations under model

-

p

(

M

j

)

= prior probability of model

-

Θ

A

= acoustics-related model parameters

-

Θ

L

= language-related model parameters

• Questions:

- what form of model to use for ?

- how to find

Θ

A

(training)?

- how to solve for

M

j

(decoding)?

M*

p M j X Θ,( )M jargmax =

p X M j ΘA,( ) p M j ΘL( )M jargmax =

p X M j ΘA,( )


State-based modeling

• Assume discrete-state model for the speech:- observations are divided up into time frames- model → states → observations:

• Probability of observations given model is:

- sum over all possible state sequences Qk

• How do observations depend on states?How do state sequences depend on model?

q1Qk : q2 q3 q4 q5 q6 ...

x1X1 : x2 x3 x4 x5 x6 ...time

states

Nobserved feature

vectors

Model Mj

p X M j( ) p X1N

Qk M j,( ) p Qk M j( )⋅all Qk∑=


The speech recognition chain

• After classification, still have problem of classifying the sequences of frames:

• Questions- what to use for the acoustic classifier?- how to represent ‘model’ sequences?- how to score matches?

Featurecalculation

sound

Acousticclassifier

feature vectors

Networkweights

HMMdecoder

phone probabilities

phone & word labeling

Word modelsLanguage model


Outline



Acoustic modelingdefining targets- neural networks & Gaussian models


1

2

3

4


Acoustic Modeling

• Goal: Convert features into probabilities of particular labels:

i.e find over some state set {qi}

- conventional statistical classification problem

• Classifier construction is data-driven- assume we can get examples of known good Xs

for each of the qis- calculate model parameters by standard training

scheme

• Various classifiers can be used- GMMs model distribution under each state- Neural Nets directly estimate posteriors

• Different classifiers have different properties- features, labels limit ultimate performance

3

p qni Xn( )


Defining classifier targets

• Choice of {qi} can make a big difference- must support recognition task- must be a practical classification task

• Hand-labeling is one source...- ‘experts’ mark spectrogram boundaries

• ...Forced alignment is another- ‘best guess’ with existing classifiers, given words

• Result is targets for each training frame:

Featurevectors

Trainingtargets g

time

w eh n


Forced alignment

• Best labeling given existing classifierconstrained by known word sequence

Featurevectors

Phoneposterior

probabilitiesKnown wordsequence

Trainingtargets

time

ow th r iy n s

ow th r iy ...

ow th r iy

Existingclassifier

ConstrainedalignmentDictionary

Classifiertraining


Gaussian Mixture Models vs. Neural Nets

• GMMs fit distribution of features under states:

- separate ‘likelihood’ model for each state qi

- match any distribution given enough data

• Neural nets estimate posteriors directly

- parameters set to discriminate classes

• Posteriors & likelihoods related by Bayes’ rule:

p x qk( )1

2π( )dΣk

1 2⁄----------------------------------- 1

2--- x µµµµk–( )

TΣk

1– x µk–( )–exp⋅=

p qk x( ) F w jk F wijxij∑[ ]⋅j∑[ ]=

p qk x( )

p x qk( ) Pr qk( )⋅

p x q j( ) Pr q j( )⋅j∑

-----------------------------------------------=


Outline



Acoustic classification

The Hidden Markov Model (HMM)- generative Markov modelshidden Markov models- model fit likelihood- HMM examples

1

2

3

4


Markov models

• A (first order) Markov model is a finite-state system whose behavior depends only on the current state

• E.g. generative Markov model:

3

A

S

E

C

Bp(qn+1|qn)

SABCE

0 1 0 0 0

0 0 0 0 1

0 .8 .1 .1 00 .1 .8 .1 00 .1 .1 .7 .1

S A B C E

qn

qn+1.8 .8

.7

.1

.1

.1.1.1

.1

.1

S A A A A A A A A B B B B B B B B B C C C C B B B B B B C E


Hidden Markov models

• = Markov model where state sequence Q = {qn} is not directly observable (= ‘hidden’)

• But, observations X do depend on Q:

- xn is rv that depends on current state:

- can still tell something about state seq...

p x q( )

1

2

3

0

0.2

0.4

0.6

0.8

0 10 20 300

0

1

2

3

0 1 2 3 40

0.2

0.4

0.6

0.8

observation x time step n

State sequenceEmission distributions

Observationsequence

x nx n

p(x|

q)p(

x|q)

q = A q = B q = C

q = A q = B q = CAAAAAAAABBBBBBBBBBBCCCCBBBBBBBC


(Generative) Markov models (2)

• HMM is specified by:

+ (initial state probabilities )

k a t

k a t •

k a t •

••

•

•

•kat

k a t •

0.9 0.1 0.0 0.01.0 0.0 0.0 0.0

0.0 0.9 0.1 0.00.0 0.0 0.9 0.1

p(x|q)

x

- states qi

- transition probabilities aij

- emission distributions bi(x)

p qnj qn 1–

i( ) aij≡

p x qi( ) bi x( )≡

p q1i( ) πi≡


Markov models for speech

• Speech models Mj- usually left-to-right HMMs (sequence constraint)- observation & evolution

are conditionally independent of rest given (hidden) state qn

- self-loops for time dilation

ae1S Eae2 ae3 q1

x1

q2

x2

q3

x3

q4

x4

q5

x5


Markov models for sequence recognition

• Independence of observations:

- observation xn depends only current state qn

• Markov transitions:

- transition to next state qi+1 depends only on qi

p X Q( ) p x1 x2 …xN, , q1 q2 …qN, ,( )=

p x1 q1( ) p x2 q2( ) … p xN qN( )⋅ ⋅=

p xn qn( )n 1=N

∏= bqn xn( )n 1=N

∏=

p Q M( ) p q1 q2 …qN, , M( )=

p qN q1…qN 1–( ) p qN 1– q1…qN 2–( )…p q2 q1( ) p q1( )=

p qN qN 1–( ) p qN 1– qN 2–( )…p q2 q1( ) p q1( )=

p q1( ) p qn qn 1–( )n 2=N

∏= πq1 aqn 1– qnn 2=N

∏=


Model-fit calculation

• From ‘state-based modeling’:

• For HMMs:

• Hence, solve for M* :

- calculate for each available model,

scale by prior →

• Sum over all Qk ???

p X M j( ) p X1N

Qk M j,( ) p Qk M j( )⋅all Qk∑=

p X Q( ) bqn xn( )n 1=N

∏=

p Q M( ) πq1 aqn 1– qnn 2=N

∏⋅=

p X M j( )

p M j( ) p M j X( )


Summing over all paths

q0 q1 q2 q3 q4S A A A ES A A B ES A B B ES B B B E

.9 x .7 x .7 x .1 = 0.0441

.9 x .7 x .2 x .2 = 0.0252

.9 x .2 x .8 x .2 = 0.0288

.1 x .8 x .8 x .2 = 0.0128Σ = 0.1109 Σ = p(X | M) = 0.4020

2.5 x 0.2 x 0.1 = 0.052.5 x 0.2 x 2.3 = 1.152.5 x 2.2 x 2.3 = 12.650.1 x 2.2 x 2.3 = 0.506

0.00220.02900.36430.0065

SABE

AS B E0.9• 0.1 •0.7• 0.2 0.1•• 0.8 0.2•• • 1

All possible 3-emission paths Qk from S to Ep(Q | M) = Πn p(qn|qn-1) p(X | Q,M) = Πn p(xn|qn) p(X,Q | M)

Observationlikelihoods

x1 x2 x3

AB

p(x|q)

q{ 2.5 0.2 0.10.1 2.2 2.3A B ES

0.9

0.1

0.7

0.2

0.1

0.8

0.2

Model M1 xn

n1

p(x|B)p(x|A)

2 3

Observationsx1, x2, x3

S0 1 2 3 4

A

B

E

Sta

tes

time n

0.1

0.9

0.80.2

0.7

0.8 0.2

0.2

0.7

0.1

Paths

(length 3 paths only)


The ‘forward recursion’

• Dynamic-programming-like technique to calculate sum over all Qk

• Define as the probability of getting to

state qi at time step n (by any path):

• Then can be calculated recursively:

αn i( )

αn i( ) p x1 x2 …xn qn=qi

, , ,( ) p X1n

qni

,( )≡=

αn 1+ j( )

Time steps n

Mo

del

sta

tes

qi

αn(i+1)

αn(i)

ai+1j

aij

bj(xn+1)

αn+1(j) = [Σ αn(i)·aij]·bj(xn+1)i=1S


Forward recursion (2)

• Initialize

• Then total probability

→ Practical way to solve for p(X | Mj) and hence perform recognition

α1 i( ) πi bi x1( )⋅=

p X1N

M( ) αN i( )i 1=

S

∑=

ObservationsX

p(X | M1)·p(M1)

p(X | M2)·p(M2)

Choose best


Optimal path

• May be interested in actual qn assignments

- which state was ‘active’ at each time frame- e.g. phone labelling (for training?)

• Total probability is over all paths...

• ... but can also solve for single best path= “Viterbi” state sequence

• Probability along best path to state :

- backtrack from final state to get best path- final probability is product only (no sum)→ log-domain calculation is just summation

• Total probability often dominated by best path:

qn 1+j

αn 1+*

j( ) αn*

i( )aij{ }imax b j xn 1+( )⋅=

p X Q*

, M( ) p X M( )≈


Interpreting the Viterbi path

• Viterbi path assigns each xn to a state qi

- performing classification based on bi(x)

- ... at the same time as applying transition constraints aij

• Can be used for segmentation- train an HMM with ‘garbage’ and ‘target’ states- decode on new data to find ‘targets’, boundaries

• Can use for (heuristic) training- e.g. train classifiers based on labels...

0 10 20 300

1

2

3

Viterbi labels:

Inferred classification

x n

AAAAAAAABBBBBBBBBBBCCCCBBBBBBBC


Recognition with HMMs

• Isolated word

- choose best

• Continuous speech- Viterbi decoding of one large HMM gives words

p M X( ) p X M( ) p M( )∝

Model M1p(X | M1)·p(M1) = ...

Model M2p(X | M2)·p(M2) = ...

Model M3p(X | M3)·p(M3) = ...

Input

w ah n

th r iy

t uw

Inputp(M1)

p(M2)

p(M3)sil

w ah n

th r iy

t uw


HMM examples: Different state sequences

K AS

0.8

0.2

0.9

0.1T E

0.8

0.2Model M1

K OS

0.8

0.2

0.9

0.1T E

0.8

0.2Model M2

Emissiondistributions

00 1 2 3 4

0.2

0.4

0.6

0.8

p(x | q)

x

p(x |

K)

p(x |

T)

p(x |

A)

p(x |

O)


Model matching:Emission probabilities

0 2 4 6 8 10 12 14 16 180

2

4

Observationsequence

xntime n / steps

0

2

4

-3-2-10

-10

-5

0

0

2

4

-3-2-10

-10

-5

0

log p(X,Q* | M) = -47.5

log p(X | M) = -47.0

log p(Q* | M) = -8.3

log p(X | Q*,M) = -39.2

Model M2

state alignment

log trans.prob

log obs.l'hood

state alignment

log trans.prob

log obs.l'hood

log p(X,Q* | M) = -33.5

log p(X | M) = -32.1

log p(Q* | M) = -7.5

log p(X | Q*,M) = -26.0

Model M1

K

AT

O


Model matching:Transition probabilities

K

A

S

0.80.18

0.9

0.1

0.02

0.9

0.1T E

0.8

0.2

Model M'1

O

K

A

S

0.80.05

0.9

0.1

0.15

0.9

0.1T E

0.8

0.2

O

Model M'2

-3-2-10

0

0 2 4 6 8 10 12 14 16 18

time n / steps

0

2

4

-3-2-1

0

-10

-5

log trans.problog p(X,Q* | M) = -34.9

log p(X | M) = -33.5

log p(Q* | M) = -8.9

Model M'2

state alignment

log trans.prob

log obs.l'hood

log p(X,Q* | M) = -33.6log p(X | M) = -32.2

log p(Q* | M) = -7.6

log p(X | Q*,M) = -26.0

Model M'1


Validity of HMM assumptions

• Key assumption is conditional independence:

Given qi, future evolution & obs. distributionare independent of previous events- duration behavior:

self-loops imply exponential distribution

- independence of successive xns

?

n

p(N = n)

γ 1−γ

γ(1−γ)

n

xn

p(xn|qi)

p X( ) p xn qi( )∏=


Recap: Recognizer Structure

• We now know how to execute each stage

• .. but how to train HMMs?

• .. where to get word/language models?

Featurecalculation

sound

Acousticclassifier

feature vectors

Networkweights

HMMdecoder

phone probabilities

phone & word labeling

Word modelsLanguage model


Summary

• Speech is modeled as a sequence of features- need temporal aspect to recognition- best time-alignment of templates = DTW

• Hidden Markov models are rigorous solution- self-loops allow temporal dilation- exact, efficient likelihood calculations

Parting thought:How to set the HMM parameters? (training)

Signal template matchingDynamic Time Warp (DTW)DTW-based recognitionDTW-based recognition (2)OutlineStatistical sequence recognitionStatistical formulation (2)State-based modelingThe speech recognition chainOutlineAcoustic ModelingDefining classifier targetsForced alignmentGaussian Mixture Models vs. Neural NetsOutlineMarkov modelsHidden Markov models(Generative) Markov models (2)Markov models for speechMarkov models for sequence recognitionModel-fit calculationSumming over all pathsThe ‘forward recursion’Forward recursion (2)Optimal pathInterpreting the Viterbi pathRecognition with HMMsHMM examples: Different state sequencesModel matching: Emission probabilitiesModel matching: Transition probabilitiesValidity of HMM assumptionsRecap: Recognizer StructureSummary