View
227
Download
1
Category
Tags:
Preview:
Citation preview
Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition
Presenter: Shih-Hsiang Lin
Luis Buera, Eduardo Lleida, Antonio Miguel, Alfonso Ortega, and Óscar Saz
Communication Technologies Group (GTC)Aragon Institute of Engineering Research (I3A) University of Zaragoza, Spain
IEEE Transactions on Audio, Speech and Language Processing, Feb.,2007
2
Reference
• “Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition,” IEEE Trans on Audio, Speech and Language Processing, 2007
• “Multi-environment models based linear normalization for robust speech recognition in car conditions,” in Proc. ICASSP 2004
• “Multi-environment models based linear normalization for robust speech recognition,” in Proc. SPECOM 2004
• “Robust speech recognition in cars using phoneme dependent multi-environment linear normalization,” in Proc. EuroSpeech 2005
• “Recent advances in PD-MEMLIN for speech recognition in car conditions,” in Proc. ASUR 2005
3
Outline
• Introduction
• Approaches– Multi-Environment Model-based Linear Normalization (MEMLIN)
– Polynomial MEMLIN (P-MEMLIN)
– Multi-Environment Model-based Histogram Normalization (MEMHIN)
– Phoneme-Dependent MEMLIN (PD-MEMLIN)
– Blind PD-MEMLIN
• Experimental Results
• Conclusions
4
Introduction
• Robustness techniques have been developed along the following two main lines of research– Acoustic model adaptation method
• Require more data and computing time– MAP, MLLR, PMC
– Feature vector adaptation/normalization method• Map recognition space feature vectors to the training space
– High-pass filtering» The results produced by those methods are limited individually» CMN, RASTA processing
– Model-based techniques» VTS, CDCN
– Empirical compensation» Entirely data-driven» Need a training phase where some transformations are estimated by
computing the frame-by-frame differences between the stereo data» SPLICE, POF
5
Introduction (cont.)
• This paper focuses on empirical feature vector normalization base on stereo data and the MMSE estimator– Based on the joint modeling of clean and noisy space
• which splits noisy space into several basic environments and models each basic noisy and clean feature spaces using GMMs
– To learn a transformation between clean and noisy feature vectors associated with each pair of clean and noisy model Gaussians
Noisy basic environment spaceClean space
6
Noise Effects
Convolutional Noise:Shifts the mean of the coefficients
Additive Noise:Modifies the PDF, reducing the variances of the coefficients
Real Car Environment:Modifies the mean and variance, jointly
7
MMSE-based Feature Vector Normalization Methods
• Given the noisy feature vector , the estimated clean feature vector is obtained by using the MMSE criterion as
– Method 1: CMN
• No assumption are made in estimating
• The clean feature vector is approximated as
• To estimated the bias vector transformation, , the mean square error, , is defined and minimized w.r.t.
ty
tx̂
X
ttt dxyxxpyxEx ||ˆ
tyxp |
x ryryx tt ,
rydxyxpryx tX
ttt |ˆ
r
tt
ttrr
xEyE
xxEr
ˆminargminarg 2
8
MMSE-based Feature Vector Normalization Methods (cont.)
• In some cases, the mean of the clean feature vectors is removed before training acoustic model, so the bias vector transformation is computed as
– Method 2: (RATZ) Multivariate Gaussian-based cepstral Normalization
• Modeling the clean space using a GMM
• Approximate the clean feature vector as
• The estimation of can produce a mismatch
tyEr
x
s
x spsxpxp
x
| xx ssx xNsxp ;|
xx stst ryryx ,
x
x
x
x
s
txstX
s
txstt ysprydxysxpryx ||,ˆ
tx ysp |
9
MMSE-based Feature Vector Normalization Methods (cont.)
– Method 3: SPLICE
• Modeling the noisy space instead of the clean one using GMM
• Approximate the clean feature vector as
– Furthermore, several acoustic conditions have been developed
• Interpolated RATZ (IRATZ)
• SPLICE with environmental model selection
y
s
ytt spsypyp
y
| yy ssyt xNsyp ;|
yy stst ryryx ,
y
y
y
y
s
tystX
s
tystt ysprydxysxpryx ||,ˆ
10
Multi-Environment Model-based Linear Normalization(MEMLIN)
• MEMLIN Approximation– Noisy space is divided into a combination of several basic
environment , and the noisy feature vectors are modeled as a GMM for each basic environment
– Clean feature vectors are modeled using a GMM
– Clean feature vectors can be approximated as a linear function of the noisy feature vector, which depends on the basic environment and the clean and noisy model Gaussians
e
ey
s
eye spsypyp
ey
|
e
yey ss
ey yNsyp ;|
x
s
x spsxpxp
x
| xx ssx xNsxp ;|
eyx sst
eyxt ryssyx ,,,
bias vector transformation
11
• MEMLIN Enhancement– Given the noisy feature vector , the estimated clean feature vector
is obtained by using the MMSE criterion as
– calculate
– calculate
is considered to be uniformly distributed over all the environments has to be close to 1
MEMLIN (Cont.)
ty
eytx
e s s
teytsst
Xe s s
teyxsstt
seyspeyspyepry
dxysesxpryx
ey x
eyx
ey x
eyx
,,|,||
|,,,ˆ
,
,
tx̂
ty
e eys xs
tyep |
'
'1 1||
e
te
tett
yp
ypyepyep
0| yep
eysp tey ,|
'
''|
|,|
eys
ey
eyt
ey
eyt
tey
spsyp
spsypeysp
estimated in a training phase using stereo data
12
MEMLIN (Cont.)
• MEMLIN Training– Given a stereo data corpus for each environment
The bias vector transformation, , is estimated by minimizing the defined mean weighted squared square error
2
,, ,|,|
e
eyxeeee
eyx
tssttt
eytxss ryxeyspexsp
eyx ssr ,
eyx ss ,
eT
eT
eeee ee
yxyxYX ,,,,, 11
e
ee
e
eeee
eyx
eyx
eyx
eyx
eysxs
eyx
t
teytx
t
ttteytx
ss
ss
ss
ssr
ss
eyspexsp
xyeyspexsp
r
r
,|,|
,|,|
0
minarg
,
,
,
,,,
'
''|
|,|
ey
e
e
e
s
xxt
xxttx
spsxp
spsxpexsp
'
''|
|,|
ey
e
e
e
s
ey
eyt
ey
eyt
tey
spsyp
spsypeysp
13
– Calculate
• The cross probability is simplified by avoiding the time dependence given by the noisy feature vector
• The term can be estimated by either using a hard solution or using a soft decision
– Hard decision (using relatively frequency)
– Soft decision
MEMLIN (Cont.)
eytx seysp ,,|
esspseysp eyx
eytx ,|,,|
eys
eyxNe
yx N
ssCessp
|,|
'
'|'|
||
,|
x e
ee
e
ee
s t
eyx
eytxt
t
eyx
eytxt
eyx
spspsypsxp
spspsypsxp
essp
14
Experiments Results Using Basic MMSE-based methods
• A set of experiments were performed using the Spanish SpeechDat Car database– Seven basic environments were defined
• E1: car stopped, motor running• E2: town traffic, closed windows, and climatizer off (silent conditions)• E3: town traffic and noisy conditions (windows open, and/or climatizer on)• E4: low speed, rough road, and silent conditions• E5: low speed, rough road, and noisy conditions• E6: high speed, good road, and silent conditions• E7: high speed, good road, and noisy conditions
– Tow channels have been used (stereo data)• Close talK channel (CLK)• Hands-free channel (HF)
• The recognition task is isolated and continuous digits recognition
15
Experiments Results (cont.)
HFCLKCLKCLK
HFCLK
MWERMWER
MWERMWER100MIMP
• The SPLICE MS method always produces better results than does IRATZ - because of the assumption of the a posterior probability
• MEMLIN performed better than IRATEZ and SPLICE MS
16
Improvement Over MEMLIN
• There are two important approximations in MEMLIN expressions that can affect the final performance of the method– The selection of the linear model for associated with a pair of Gaussians
• compensates for the mean shift, but not for the modification of variance
– Treating all of the sound in the same way
• There is always a bias vector transformation which maps from a noisy model Gaussian to every clean model one
– e.g. non silence noisy feature vectors are mapped towards the clean silence
eyx sst
eyxt ryssyx ,,,
x
17
Polynomial MEMLIN (P-MEMLIN)
• The transformation function for P-MEMLIN is
– Given the noisy feature vector , the estimated clean feature vector is obtained by using the MMSE criterion as
– and are computed in the training phase using stereo data
eyxt ssy ,,
ibiyiaissyix eyx
eyx sstss
eyxt ,,,,
indexdimension :i
ty
iseyspeyspyepibiyiaix eytx
e s s
teytsstsst
ey x
eyx
eyx
,,|,||ˆ
,,
tx̂
ia eyx ss ,
ib eyx ss ,
ii
iay
ss
xss
sseyx
eyx
eyx
,
,
,
iii
iib x
ssy
ssyss
xss
ss eyx
eyx
eyx
eyx
eyx ,,
,
,
,
where
e
ee
e
eee
eyx
t
teytx
t
tteytx
zss yspxsp
izyspxsp
i||
||
,
e
ee
e
eyxeee
eyx
t
teytx
t
zsstt
eytx
zss yspxsp
iizyspxsp
i||
||2
,
,
If the standard deviation terms are equal, the algorithm expressions are the same as those in MEMLIN
18
Multi-Environment Model-based Histogram Normalization (MEMHIN)
• Sometime noise can produce a more complex modification of clean and noisy feature pdfs associated with a pair of Gaussians– In that case, the linear approximation for of MEMLIN or P-MEM
LIN is not the best option
– Therefore, a nonlinear model based on histogram equalization is used
• The transformation function for MEMHIN is expressed as
– band histograms associated with and for each component of the noisy and clean feature vectors are obtained in the training phase
eyxt ssy ,,
eytxt
eyttssyssx
eyxt seyspeyspyepyCCssyx e
yxeyx
,,|,||,, ,,1
,,
eyxt ssy ,,
n xs eys
19
Results from modifications
• P-MEMLIN and MEMHIN provide significant improvement over MEMLIN when few Gaussians are considered– 33.87% of MIMP for MEMLIN
– 39.12% of MIMP for P-MEMLIN
– 37.82% of MIMP for MEMHIN
– However, if the algorithms are evaluated using more than eight Gaussians per environment, the mean results are very similar among the three models
• Then, additive car noise was added to clean signals of the Spanish SpeechDat Car database (5dB noise)
eyxt ssy ,,
4 Gaussians per environment
21
PD-MEMLIN (Cont.)
• PD-MEMLIN Approximation– Noisy space is split into several basic environment . The noisy feature
vectors associated with the different phonemes of each basic environment are modeled as GMM
– The clean feature vectors of each phoneme are modeled as a GMM
– Clean feature vector can be approximated by a linear function that depend on the environment and the phoneme-dependent Gaussians of the clean and noisy model
e
phey
s
pheyphe spsypyp
phey
,,,
,
|
phe
yphe
y ssphe
y yNsyp ,,;| ,
ph
phx
s
phxph spsxpxp
phx
| phx
phx ss
phx xNsxp ;|
phey
phx sst
phey
phxt ryssyx ,,
,,,
22
• PD-MEMLIN Enhancement– Given the noisy feature vector , the estimated clean feature vector
is obtained by using the MMSE criterion as
– calculate
• PD-MEMLIN Training
PD-MEMLIN (Cont.)
ty
pheyt
phx
e ph s s
tphe
yttsstt spheysppheyspeyphpyepryxphe
yphx
phey
phx
,,, ,,,|,,|,||ˆ
,
,
tx̂
eyphp t ,|
'
',
,,|
ph
tphe
tphet
yp
ypeyphp
2
,,,
,,
,,
,,,,
,
, ,,|,,|
phe
eyxeeee
phey
phx
tss
phepht
phepht
phepht
phey
phepht
phxss rxpheyspphexsp
23
– cross probability
– bias vector transformation
PD-MEMLIN (Cont.)
For a more detailed description, please refer to the paper
phx phe
phephe
phe
phephe
s t
phey
phx
phey
phet
phx
phet
t
phey
phx
phey
phet
phx
phet
phey
phx
pheyt
phx
spspsypsxp
spspsypsxp
esspspheysp
,
,,
,
,,
,,,,
,,,,
,,
||
||
,|,,,|
phe
phephe
phe
phephephephe
phey
phx
pheysph
xs
phey
phx
t
phet
phey
phet
phx
t
phet
phet
phet
phey
phet
phx
ssr
ss
pheyspphexsp
xypheyspphexsp
r
,
,,
,
,,,,
,
,,
,
,,|,,|
,,|,,|
minarg
,,,
,,,,,
,,
phe
ys
phey
phxNphe
yphx
pheyt
phx N
ssCphesspspheysp
,
,,, |
,,|,,,| Hard solution
Soft solution
24
• 25 Spanish phonemes and the silence– To make a fair comparisons between two methods, the results have bee
n plotted as a function of the number of Transformations per basic Environment (TpE)
Results from PD-MEMLIN
phss nnnTpE ph
xphy
10log
physn
phxsn
phn
The number of noisy Gaussians for phoneme
The number of clean Gaussians for phoneme
The number of phoneme (1 for MEMLIN)
ph
ph(2,4,8,16,32)
The results show that PD-MEMLIN makes significant improvements relative to MEMLIN, specially when more than for Gaussians per phoneme are used
25
Results from PD-MEMLIN (cont.)
• To estimate the limit of the PD-MEMLIN– Each frame was normalized using only the bias vector transformation
of the “correct” phoneme (KPD-MEMLIN)
phe
ytphx
e s s
tphe
ytsstt spheysppheyspyepryxphe
yphx
phey
phx
,,,
,,,|,,||ˆ,
,
pheyt
phx
e ph s s
tphe
yttsstt spheysppheyspeyphpyepryxphe
yphx
phey
phx
,,, ,,,|,,|,||ˆ
,
,
KPD-MEMLIN
PD-MEMLIN
MCP: mean correct phoneme
26
Blind PD-MEMLIN
• In many cases, stereo data are not available– An iterative “blind” training procedure is needed
– Assume that the noisy training feature vectors and the phoneme-dependent clean and noisy GMMs are available
– The problem is to estimate the cross probability and the bias vector transformation
– It consists of an initialization and an iterative process
27
• Initialization– The cross probability is estimated using a modified Kullbac
k-Leibler distance
• Gives a similarity measure of and without considering the effects of the noise
– Assume that the noise modifies mainly the mean vectors of the Gaussian model
• So, the similarity is computed in terms of the a priori probabilities and the diagonal covariance matrices of the corresponding Gaussians
• Since KL distance is not symmetric, and it is not proportional to the likelihood; therefore, a pseudo-likelihood is defined
Blind PD-MEMLIN (cont.)
phessp phey
phx ,,| ,
0
phxs phe
ys ,
1log
2
1,
2
2
2
2
2
2
Y
XY
X
Y
X
YKL YXD
phx
pheyphe
y
i s
s
s
sphe
yphx
pheyKL
sp
spsp
ii
ii
ii
iispssd
phx
phey
phey
phx
,,
,, log1
,
,
,
,log
2,
,
,
phx
pheyKL sspl ,,
phey
phxKL
phx
pheyKL
phx
pheyKL
ssdssdsspl
,,,
,,
1,
28
Blind PD-MEMLIN (cont.)
• Finally, is estimated as
– On the other hand, is defined as
– The mean improvement in WER over the seven basic environments and four Gaussians per phoneme-depend GMM was 20.2%
phessp phey
phx ,,| ,
0
'
,
,,
0',
,,,|
phxs
phx
pheyKL
phx
pheyKLphe
yphx
sspl
ssplphessp
phe
phe
phe
phxphephe
phey
phx
t
phet
phey
ts
phet
phet
phey
sspheysp
ypheysp
r
,
,
,
,,
,
,,|
,,|
,,
,,,
,,0
phey
phx ssr ,,,0
29
Blind PD-MEMLIN (cont.)
• Iterative process– Objective function
– The value of is obtained by taking partial derivatives and setting it equal to zero
constpheyssp
phessyppheysspQ
phey
phey
phephe
yphx
phe
phe
phephe
yphx
phe
sT
st s s
phet
phey
phx
newphe
yphx
phet
t s s
phet
phey
phx
phenew
,,
,,
,
,
,,
,
2
1log
2
1,,,|,
,,|,,log,,,|,,
,,
,,,,
phe
yphx ssr ,, ph
xphe
yphxphe sssnew
phet ry ,
, ,,,
phe
phe
phxphe
phe
phe
phey
phx
t
phet
phey
phx
sphe
tt
phet
phey
phx
ssnewpheyssp
ypheyssp
r
,
,
,
,
,
,
,,,|,
,,,|,
,,
,,,
,,
phey
phx ssnewr ,,,
30
Blind PD-MEMLIN (cont.)
– Update the cross probability and the bias vector transformation
– The mean improvement in WER in this case was 41.03% if n=1 and 46.90% if n=10
phe
phephe
phxphe
phe
phephe
phey
phx
t
phey
phet
phx
phet
phey
sphe
tt
phey
phet
phx
phet
phey
ssnnsysppheysp
ynsysppheysp
r
,
,,
,
,
,,
,
1,,|,,|
1,,|,,|
,,,,
,,,,,
,,
phx
phey
phx
phey
phxphe
phey
phx
phey
phxphe
phe
s
pheyssssn
phet
pheyssssn
phetphe
yphe
tphx
spryN
spryNnsysp
,,,1
,
,,,1
,
,,
,,,
,,,
,,;
,;1,,|
Recommended