Minimum Phone Error (MPE) Model and Feature Training

Minimum Phone Error (MPE)Model and Feature Training

ShihHsiang 2006

2

The derivation flow of the various training criteria

xx log1

3

Difference

• MPE v.s. ORCE– ORCE focuses on word error rate and is implemented on N-best res

ults– MPE focuses on phone accuracy and is implemented on a word gra

ph also introduces the prior distribution of the new estimated models (I-smoothing)

• MPE v.s. MMI– MMI treated the correct transcriptions as the numerator lattice and th

e whole word graph as the denominator lattice or the competing sequences

– MPE treats all possible correct sequences on the word graph as the numerator lattice, and treats all possible wrong sequences as the denominator lattice

4

fMPE (cont.)

• Feature-space minimum phone error (fMPE) is a discriminative training method which adds an offset to the old feature

ttt Mhoy

current feature

transform matrix

high-dimensional feature

current frame

Each vector contains 10,000 Gaussian posterior probabilityAnd the Gaussian likelihoods are evaluated with no priors

average

5

fMPE (cont.)

• Objective Function

using gradient descent to update the transformation matrix

Direct differential

r u

r

latticevr

rMPE suAcc

vPvOP

uPuOPF ,

|

|

ij

T

t ti

MPE

ij

ijT

t ti

MPE

ij

MPE hy

F

M

y

y

F

M

F

11

ti

smtS

s

M

m smtti

direct

y

l

l

F

y

F

1 1

smitismismi

S

s

M

m sm

sm

ti

indirect

yFFt

y

F

2

1 1

2ti

indirect

ti

direct

ti

MPE

y

F

y

F

y

F

ij

MPEijijij M

FvMM

6

fMPE (cont.)

• When using only direct differential to update the transformation matrix, significant improvements are obtainable but then lost very soon when the acoustic model is retrained with ML

• The indirect differential part thus aims to reflect the model change from the ML training with new features,

7

offset fMPE

• The difference of offset fMPE from the original fMPE is the definition of the high dimensional vector t h of posterior probabilities

where represents the posterior of i -th Gaussian at time tsize:

• The number of Gaussians needed is about 1000, which is significantly lower than 100000 for the original fMPE

T

1111111

],2/22,1/11,0.5

,2/22,1/11,0.5[

nn

tnt

nnt

nt

nt

tttttt

xx

xxh

nt

dimension dependent

1: dNht

N

jjt

itnt

gOp

gOp

1

|

|

8

Dimension-weighted offset fMPE

• Different from the offset fMPE which gives the same weight on each dimension of the feature offset vector– calculates the posterior probability on each dimension of the feature

offset vector

T

1111111

],2/222,1/111,0.5

,2/222,1/111,0.5[

nn

tnt

nnt

nt

nt

tttttt

xx

xxh

N

jjt

itnt

dgdOp

dgdOpd

1

|

|

9

Experiments (on MATBN)

• Error rates (%) for MPE and fMPE for different features, on different acoustic levels.

10

Experiments (cont.)

• CER(%) for offset fMPE and dimension-weighted offset fMPE with different features

11

Connect to SPLICE

• Decomposition Scheme 1

= +

ty to M

th

1p 1p qp

1q

pp

M)1(

1)1( p

th

1)( pnth

ppnM)(

n

i

iitt hMoy

1

)()(

12

Connect to SPLICE (cont.)

• Compensation of the original feature is carried out by adding a large number of bias vectors, each of which is computed as a full-rank rotation of a small set of posterior probabilities

• Maximum-Likelihood estimation

n

i

iit

n

i

iitt hMohMoy

1

*)(*)(

1

)()(

*i denotes the term greater than remaining (n-1) terms

13


• Decomposition Scheme 2

= +

ty to M

th

1p 1p qp

1q

1m

1h

2m 3m km

2h3h4h

2kh

1khkh

q

iktt

q

ikkttt mxkpomhoMhoy

11

|

14


• The compensation vector consists of a linear weighted sum of a set of frame-independent correction vectors, where the weight is the posterior probability associated with the corresponding correction vector

• The key difference is– the bias vector for compensation in fMPE is specific to each time

frame t– the bias vector in feature-space stochastic matching is common

over all frames in the utterance

Documents

Minimum Phone Error (MPE) Model and Feature Training