Upload
lyneth
View
36
Download
0
Embed Size (px)
DESCRIPTION
Minimum Phone Error (MPE) Model and Feature Training. ShihHsiang 2006. The derivation flow of the various training criteria. Difference. MPE v.s. ORCE ORCE focuses on word error rate and is implemented on N-best results - PowerPoint PPT Presentation
Citation preview
Minimum Phone Error (MPE)Model and Feature Training
ShihHsiang 2006
2
The derivation flow of the various training criteria
xx log1
3
Difference
• MPE v.s. ORCE– ORCE focuses on word error rate and is implemented on N-best res
ults– MPE focuses on phone accuracy and is implemented on a word gra
ph also introduces the prior distribution of the new estimated models (I-smoothing)
• MPE v.s. MMI– MMI treated the correct transcriptions as the numerator lattice and th
e whole word graph as the denominator lattice or the competing sequences
– MPE treats all possible correct sequences on the word graph as the numerator lattice, and treats all possible wrong sequences as the denominator lattice
4
fMPE (cont.)
• Feature-space minimum phone error (fMPE) is a discriminative training method which adds an offset to the old feature
ttt Mhoy
current feature
transform matrix
high-dimensional feature
current frame
Each vector contains 10,000 Gaussian posterior probabilityAnd the Gaussian likelihoods are evaluated with no priors
average
5
fMPE (cont.)
• Objective Function
using gradient descent to update the transformation matrix
Direct differential
r u
r
latticevr
rMPE suAcc
vPvOP
uPuOPF ,
|
|
ij
T
t ti
MPE
ij
ijT
t ti
MPE
ij
MPE hy
F
M
y
y
F
M
F
11
ti
smtS
s
M
m smtti
direct
y
l
l
F
y
F
1 1
smitismismi
S
s
M
m sm
sm
ti
indirect
yFFt
y
F
2
1 1
2ti
indirect
ti
direct
ti
MPE
y
F
y
F
y
F
ij
MPEijijij M
FvMM
6
fMPE (cont.)
• When using only direct differential to update the transformation matrix, significant improvements are obtainable but then lost very soon when the acoustic model is retrained with ML
• The indirect differential part thus aims to reflect the model change from the ML training with new features,
7
offset fMPE
• The difference of offset fMPE from the original fMPE is the definition of the high dimensional vector t h of posterior probabilities
where represents the posterior of i -th Gaussian at time tsize:
• The number of Gaussians needed is about 1000, which is significantly lower than 100000 for the original fMPE
T
1111111
],2/22,1/11,0.5
,2/22,1/11,0.5[
nn
tnt
nnt
nt
nt
tttttt
xx
xxh
nt
dimension dependent
1: dNht
N
jjt
itnt
gOp
gOp
1
|
|
8
Dimension-weighted offset fMPE
• Different from the offset fMPE which gives the same weight on each dimension of the feature offset vector– calculates the posterior probability on each dimension of the feature
offset vector
T
1111111
],2/222,1/111,0.5
,2/222,1/111,0.5[
nn
tnt
nnt
nt
nt
tttttt
xx
xxh
N
jjt
itnt
dgdOp
dgdOpd
1
|
|
9
Experiments (on MATBN)
• Error rates (%) for MPE and fMPE for different features, on different acoustic levels.
10
Experiments (cont.)
• CER(%) for offset fMPE and dimension-weighted offset fMPE with different features
11
Connect to SPLICE
• Decomposition Scheme 1
= +
ty to M
th
1p 1p qp
1q
pp
M)1(
1)1( p
th
1)( pnth
ppnM)(
n
i
iitt hMoy
1
)()(
12
Connect to SPLICE (cont.)
• Compensation of the original feature is carried out by adding a large number of bias vectors, each of which is computed as a full-rank rotation of a small set of posterior probabilities
• Maximum-Likelihood estimation
n
i
iit
n
i
iitt hMohMoy
1
*)(*)(
1
)()(
*i denotes the term greater than remaining (n-1) terms
13
Connect to SPLICE (cont.)
• Decomposition Scheme 2
= +
ty to M
th
1p 1p qp
1q
1m
1h
2m 3m km
2h3h4h
2kh
1khkh
q
iktt
q
ikkttt mxkpomhoMhoy
11
|
14
Connect to SPLICE (cont.)
• The compensation vector consists of a linear weighted sum of a set of frame-independent correction vectors, where the weight is the posterior probability associated with the corresponding correction vector
• The key difference is– the bias vector for compensation in fMPE is specific to each time
frame t– the bias vector in feature-space stochastic matching is common
over all frames in the utterance