Upload
vivek-tyagi
View
213
Download
1
Embed Size (px)
Citation preview
Speech Communication 48 (2006) 1528–1544
www.elsevier.com/locate/specom
Least squares filtering of speech signals for robust ASR q
Vivek Tyagi *, Christian Wellekens, Dirk T.M. Slock
Institute Eurecom, Multimedia Communications, 2229, route des cretes, P.O. Box 193, 06094 Sophia-Antipolis, PACA, France
Swiss Federal Institute of Technology, Lausanne, Switzerland
Received 31 July 2005; received in revised form 24 July 2006; accepted 28 July 2006
Abstract
The behavior of the least squares filter (LeSF) is analyzed for a class of non-stationary signals that are either (a) com-posed of multiple sinusoids (voiced speech) whose frequencies, phases and the amplitudes may vary from block to block or(b) are output of an all-pole filter excited by white noise input (unvoiced speech segments) and which are embedded inwhite noise. In this work, analytic expressions for the weights and the output of the LeSF are derived as a function ofthe block length and the signal SNR computed over the corresponding block. We have used LeSF filter estimated on eachblock to enhance the speech signals embedded in white noise as well as other realistic noises such as factory noise and anaircraft cockpit noise. Automatic speech recognition (ASR) experiments on a connected numbers task, OGI Numbers95[Varga, A., Steeneken, H., Tomlinson, M., Jones, D., 1992. The NOISEX-92 study on the effect of additive noise on auto-matic speech recognition. Technical Report, DRA Speech Research Unit, Malvern, England] show that the proposedLeSF based features provide a significant improvement in speech recognition accuracies in various non-stationary noiseconditions when compared directly to the un-enhanced speech, spectral subtraction and noise robust CJ-RASTA-PLPfeatures.� 2006 Elsevier B.V. All rights reserved.
Keywords: Least squares; Adaptive filtering; Speech enhancement; Robust speech recognition
1. Introduction
Speech enhancement, amongst other signalde-noising techniques, has been a topic of great
0167-6393/$ - see front matter � 2006 Elsevier B.V. All rights reserved
doi:10.1016/j.specom.2006.07.010
q This work was supported by European Commission’s 6thFramework Program project, DIVINES under the Contract No.FP6-002034.
* Corresponding author. Address: Institute Eurecom, Multi-media Communications, 2229, route des cretes, P.O. Box 193,06094 Sophia-Antipolis, PACA, France. Tel.: +33 493002670;fax: +33 493002627.
E-mail addresses: [email protected] (V. Tyagi), [email protected] (C. Wellekens), [email protected] (D.T.M. Slock).
interest for past several decades. The importanceof such techniques in speech coding and automaticspeech recognition systems can only be understated.Towards this end, adaptive filtering techniques havebeen shown to be quite effective in various signalde-noising applications. Some representative exam-ples are echo cancellation (Sondhi and Berkley,1980), data equalization (Gersho, 1969; Satoriusand Alexander, 1979; Satorius and Pack, 1981), nar-row-band signal enhancement (Widrow et al., 1975;Bershad et al., 1980), beamforming (Griffiths, 1969;Frost, 1972; Compton, 1980), radar clutter rejection(Gibson and Haykin, 1980), system identification
.
V. Tyagi et al. / Speech Communication 48 (2006) 1528–1544 1529
(Rabiner et al., 1978; Marple, 1981) and speech pro-cessing (Widrow et al., 1975).
Most of the above mentioned representativeexamples require an explicit external noise referenceto remove additive noise from the desired signal asdiscussed in Widrow et al. (1975). In situationswhere an external noise reference for the additivenoise is not available, the interfering noise may besuppressed using a Wiener linear prediction filter(for stationary input signal and stationary noise) ifthere is a significant difference in the bandwidth ofthe signal and the additive noise (Widrow et al.,1975; Zeidler et al., 1978; Anderson et al., 1983).One of the earliest use of the least mean square(LMS) filtering for speech enhancement is due toSambur (1978). In his work, the step size of theLMS filter was chosen to be 1% of the reciprocalof the largest eigenvalue of the correlation matrixof the first voiced frame. However, speech being anon-stationary signal, the estimation of the step sizebased on the correlation matrix of just single frameof the speech signal, may lead to divergence of theLMS filter output. Nevertheless, the exposition inSambur (1978) helped to illustrate the efficacy ofthe LMS algorithm for enhancing naturally occur-ring signals such as speech. In (Zeidler et al.,1978), Zeidler et al. have analyzed the steady statebehavior of the adaptive line enhancer (ALE), animplementation of least mean square algorithm thathas applications in detecting and tracking narrow-band signals in broad-band noise. Specifically, theyhave shown that for a stationary input consisting ofmultiple (‘N’) sinusoids in white noise, the L-weightALE, can be modeled by the L · L Wiener–Hopfmatrix equation and that this matrix can be trans-
Z−P Z−1
w(0) w(1)
+
+
S
s(n) +u(n)
Fig. 1. The basic operation of the LeSF. The input to the filter is noisyweights wk are estimated using the least squares algorithm based on theenhanced signal.
formed into a set of 2N coupled linear equations.They have derived the analytical expression for thesteady-state L-weight ALE filter as function ofinput SNR and the interference between the inputsinusoids. It has been shown that the coupling termsbetween the input sinusoid pairs approach zero asthe ALE filter length increases.
In (Anderson et al., 1983), Anderson et al.extended the above mentioned analysis for a sta-tionary input consisting of finite band-width signalsin white noise. These signals consist of white Gauss-ian noise (WGN) passed through a filter whoseband-width a is quite small relative to the Nyquistfrequency, but generally comparable to the binwidth 1/L. They have derived analytic expressionsfor the weights and the output of the LMS adaptivefilter as function of input signal band-width andSNR, as well as the LMS filter length and bulkdelay ‘z�P’ (please refer to Fig. 1).
In this paper, we extend the previous work in(Anderson et al., 1983; Zeidler et al., 1978) forenhancing a class of non-stationary signals thatare composed of either (a) multiple sinusoids(voiced speech) whose frequencies and the ampli-tudes may vary from block to block or (b) are theoutput of an all-pole filter excited by white noiseinput (unvoiced speech segments) and which areembedded in white noise. The number of sinusoidsmay also vary from block to block. The key differ-ence in the approach proposed in this paper is thatwe relax the assumption of the input signal beingstationary. The method of least squares may beviewed as an alternative to Wiener filter theory(Haykin, 1993, p. 483). Wiener filters are derivedfrom ensemble averages and they require good
Z−1
w(L−1)
+
+
+
−
e(n)
y(n)S
S
speech, (x(n) = s(n) + u(n)), delayed by bulk delay = P. The filtersamples in the current frame. The output of the filter y(n) is the
1 For example, just by the design of the speech databases, theinitial few frames always correspond to the silence and hence canbe used for noise PSD estimation. However in a realistic ASRtask such assumptions cannot be made.
2 Such as raising the Wiener filter or spectral subtraction gainfunction to a certain power which is empirically tuned, dependenton the SNR conditions.
1530 V. Tyagi et al. / Speech Communication 48 (2006) 1528–1544
estimates of the clean signal power spectral density(PSD) as well as the noise PSD. Consequently,one filter (optimum in a probabilistic sense) isobtained for all realizations of the operational envi-ronment, assumed to be wide-sense stationary. Onthe other hand, the method of least squares is deter-
ministic in approach. Specifically, it involves the useof time averages over a block of data, with the resultthat the filter depends on the number of samplesused in the computation. Moreover, the method ofleast squares does not require the noise PSD esti-mate. Therefore the input signal is blocked intoframes and we analyze a L-weight least squares filter(LeSF), estimated on each frame which consists ofN samples of the input signal.
Working under the assumptions that the cleansignal spectral vector and noise spectral vector areGaussian distributed with kth spectral value inde-pendent of jth spectral value, Eprahaim and Malahderived the optimum minimum mean square error(MMSE) estimator of the clean speech’s spectralamplitude (MMSE-STA) (Ephraim and Malah,1984) and its log spectral amplitude (MMSE-LSA)(Ephraim and Malah, 1985). This assumption isvalid only if the clean signal and the noise are bothstationary processes and the spectrum is estimatedover an infinitely long window. Clearly the speechsignal is neither a stationary process nor does ithave a Gaussian distributed spectrum. Moreover,in most of the situations, the noise is not a station-ary process. Besides this, MMSE-LSA, MMSE-STA, spectral subtraction (SS) and Wiener filter(WF) based techniques need a good estimate ofnoise spectrum. It is often claimed that the estimateof the noise PSD can be obtained from‘‘non-speech’’ frames which can be detected usinga pre-tuned threshold (Ephraim and Malah, 1985,1984). However, if the noise power changes (varyingSNR conditions), there is no single threshold whichcan detect the non-speech frames. Moreover if thenoise is non-stationary, the noise PSD estimateobtained through ‘‘non-speech’’ frames may notbe able to track the noise statistics quite well as itis dependent on the availability of non-speechframes which are unevenly distributed in anutterance. Martin (2001) has proposed a noisePSD estimator based on the minimum statistics.However even this techniques relies on certainparameters which need to be tuned depending onthe degree of non-stationarity of the noise. Severalresearchers have tried to use a multitude of well-tailored tuning-parameters dependent on the a prior
knowledge of non-speech frames,1 highest and low-est SNR range, and several other ad hoc weightingfactors2 (Kim and Rose, 2003, p. 438) to achievenoise robustness in ASR.
Therefore it is desirable to develop a newenhancement technique that does not require anexplicit noise PSD estimate. The least squares filter(LeSF) based techniques fall in this category as theydo not explicitly require a noise PSD estimate.Although, the LeSF is optimal only in the case ofthe additive noise being white, the speech recogni-tion experiments, reported in this paper, indicatethat LeSF is also effective in case of non-white addi-tive noises such as the factory noise and the aircraftcockpit noise. MFCC features computed from theLESF enhanced speech signal lead to significantASR accuracy improvements in various noises aswell as SNR conditions. We have derived the ana-lytical expressions for the impulse response of theL-weight least squares filter (LesF) as a functionof the input SNR (computed over the currentframe), effective band-width of the signal (due tofinite frame length), filter length ‘L’ and framelength ‘N’.
2. Least squares filter (LeSF) for signal
enhancement
The basic operation of the LeSF is illustrated inFig. 1 and it can be understood intuitively as fol-lows. The autocorrelation sequence of the additivenoise u(n) that is broad-band decays much fasterfor higher lags than that of the speech signal. There-fore the use of a large filter length (‘L’) and the delayP causes de-correlation between the noise compo-nents of the input signal, namely (u(n � L �P + 1), u(n � L � P + 2), . . . , u(n � P)) and thenoise component of the reference signal, namely(u(n)). It is worth noting that a longer filter lengthL will also help to cause a de-correlation betweenthe noise appearing at the kth filter tap, namely,u(n � k � P + 1) (where k � L) and the noise com-ponent of the reference signal, namely, u(n). Thisis due to the fact that the broad-band noise’s
3 For interpretation of color in Figs. 2 and 3, the reader isreferred to the web version of this article.
V. Tyagi et al. / Speech Communication 48 (2006) 1528–1544 1531
auto-correlation coefficients decay quite rapidly forhigher lags. The LeSF filter responds by adaptivelyforming a frequency response which has pass-bandscentered at the frequencies of the formants of thespeech signal while rejecting as much of broad-bandnoise (whose spectrum lies away from the formantpositions). Denoting the clean and the additivenoise signals by s(n) and u(n) respectively, we obtainthe noisy signal x(n)
xðnÞ ¼ sðnÞ þ uðnÞ ð1ÞThe LeSF filter consists of L weights and the filtercoefficients wk for k 2 [0, 1,2, . . . ,L � 1] are esti-mated by minimizing the energy of the error signale(n) over the current frame, n 2 [0,N � 1]
eðnÞ ¼ xðnÞ � yðnÞ ð2Þ
where yðnÞ ¼XL�1
i¼0
wðiÞxðn� P � iÞ ð3Þ
Let A denote the (N + L) · L data matrix (Haykin,1993) of the input frame x = [x(0),x(1), . . . ,x(N � 1)] and d denote the (N + L) · 1 desired sig-nal vector which in this case is signal x appendedby L zeros. The LeSF weight vector w is then givenby
w ¼ ðAH AÞ�1AH d ð4Þ
As is well known, AHA is a symmetric L · L Toep-litz matrix whose (i, j) element is the temporal auto-correlation of the signal vector x estimated over theframe length (Haykin, 1993)
½AH A�i;j ¼ rðji� jjÞ ð5Þ
¼XN�ji�jj
n¼0
xðnÞxðnþ ji� jjÞ ð6Þ
In practice, AHA can always be assumed to be non-singular due to presence of additive noise (Haykin,1993) for filter length L < N. The weight vector win (4) can be obtained using Levinson–Durbin algo-rithm (Haykin, 1993) without incurring a significantcomputational cost.
3. LeSF applied to speech
In this section, we will analytically solve (4) toobtain the LeSF w. We model voiced speech usingsinusoidal model (McAulay and Quatieri, 1986),while unvoiced speech is modeled by a source–filtermodel. However, we show that the functional form
of the equations remain the same except for achange in the parameter values.
3.1. Voiced speech
As proposed in McAulay and Quatieri (1986),voiced speech signals can be modeled as a sum ofmultiple sinusoids whose amplitudes, phases and fre-quencies can vary from frame to frame. Let usassume that a given frame of speech signal s(n) canbe approximated as a sum of M sinusoids. The num-ber of sinusoids M may vary from block to block.Then the noisy signal x(n) can be expressed as
xðnÞ ¼XM
i¼1
Ai cosðxinþ /iÞ þ uðnÞ ð7Þ
where n 2 [0, N � 1] and u(n) is a realization ofwhite noise. Then the kth lag autocorrelation canbe shown to be
rðkÞ ¼XN�k�1
n¼0
xðnÞxðnþ kÞ
’XM
i¼1
ðN � kÞA2i cosð2pfikÞ þ Nr2dðkÞ ð8Þ
where it is assumed that the noise u(n) is white, ergo-dic and uncorrelated with the signal s(n) and N� 1/(fi � fj) for all frequency pairs (i, j). The latter condi-tion ensures that all the interference terms betweenall the sinusoids pairs (i, j) sum up to zero. The LeSFweight vector w(k) is then obtained as the solutionof the Normal equations
XL�1
k¼0
rðl� kÞwðkÞ ¼ rðlþ P Þ l 2 ½0; 1; 2; . . . ; L� 1�
ð9ÞAn enhancement example is illustrated in Fig. 2.The first pane displays the magnitude spectrum ofa clean, voiced speech frame. The second paneshows spectra of the same segment embedded inwhite noise (red curve3) and the spectrum of the en-hanced signal (blue curve). As can be noticed, thebroad band white noise has been attenuated whilethe harmonics have been retained. The third paneshows the magnitude response of the LeSF filterused for enhancing this segment and it wasestimated over the noisy segment itself. As can be
0.0 0.2 0.4 0.6 0.8 1.00
0.5
1
1.5
2
2.5x 10
5
0.0 0.2 0.4 0.6 0.8 1.00
0.5
1
1.5
2
2.5x 10
5
0.0 0.2 0.4 0.6 0.8 1.0200
400
600
800
1000
Discrete Frequency
Mag
nit
ud
e
Fig. 2. The first pane displays spectral magnitude of a cleanspeech segment. Second pane displays the spectral magnitude ofthe same segment corrupted by white noise (red curve), whereasblue curve corresponds to the spectral magnitude of the enhancedsignal. The third pane displays the frequency response of theLeSF filter that enhanced the noisy segment and was estimatedover the noisy segment itself.
1532 V. Tyagi et al. / Speech Communication 48 (2006) 1528–1544
noticed the LeSF filter automatically puts the pass-bands around the harmonics, thus enhancing thesignal while rejecting the broad-band noise. In thesections to follow, we will present further examplesand performance specifications of the LeSF filter forenhancing noisy speech signals.
The set of L linear equations described in (9) canbe solved by elementary methods if the z-transform(Sxx(z)) of the symmetric autocorrelation sequence(r(k)) is a rational function of ‘z’ (Satorius et al.,1978). Sxx(z) is given by
SxxðzÞ ¼X1
k¼�1rðkÞz�k ð10Þ
Consider then, a real symmetric rational z-trans-form with M pairs of zeros and M pairs of poles
SxxðzÞ ¼ G
QMm¼1ðz� e�bmþjWmÞðz�1 � e�bm�jWmÞQMm¼1ðz� e�amþjxmÞðz�1 � e�am�jxmÞ
ð11Þ
If the signal x is real, then so is its autocorrelationsequence, r(k). In this case the power spectrum,
Sxx(z), has quadruplet sets of poles and zeros be-cause of the presence of conjugate pairs at z =exp(±am ±xm) and z = exp(±bm ±Wm). Andersonet al. (1983) have derived the general form of thesolution to (9) for input signal with rational powerspectra such as that described by (11). In this case,the LeSF weights are given by
wðkÞ ¼XM
m¼1
ðBme�bmk cosðWmkÞ þ Cmeþbmk cosðWmkÞÞ
ð12Þ
As can be seen, LeSF consists of an exponentiallydecaying term and an exponentially growing termattributed to reflection (Widrow et al., 1975), thatoccurs due to finite filter length L. The value ofthe coefficients Bm and Cm can be determined bysolving the set of coupled equations obtained bysubstituting the expression for w(k) given in (12)into (9).
To be able to use the general form of the solutionof the LeSF filter as in (12), we need a pole–zeromodel of the input autocorrelation in the form asdescribed in (11). For sufficiently large frame lengthN, such that filter length L� N, we can make thefollowing approximation:
ðN � kÞ ’ Ne�k=N k 2 ½0; 1; 2; . . . ; L� and L� N
ð13Þ
The above can be verified by using the Taylor seriesexpansion of Ne�ak and using only the linear termas k� N. We call (a = 1/N) as avoiced. Using thisapproximation in (8), we get,
rðkÞ ¼ Ne�avoicedkXM
i¼1
A2i cosðxikÞ þ Nr2dðkÞ ð14Þ
In this form, r(k) corresponds to a sum of multipledecaying exponential sequences and its z-transformtakes up the form
SxxðzÞ ¼XM
m¼1
NA2i ð1� e�2aÞ
2
� 1
ðz� e�amþjxmÞðz�1 � e�am�jxmÞ
�
þ 1
ðz� e�am�jxmÞðz�1 � e�amþjxmÞ
�þ Nr2
where am ¼ avoiced ¼ 1=N 8m 2 ½1 �M �ð15Þ
−5 −4 −3 −2 −1 0 1 2 3 4 5−1
−0.5
0
0.5
1
4
Real Part
Imag
inar
y P
art
0 20 40 60 80 100 120 140 160 180 2000
2
4
6
Discrete frequency index
Mag
nit
ud
e re
spo
nse
0 20 40 60 80 100 120 140 160 180 2000
20
40
60
Discrete frequency index
FF
T m
agn
itu
de clean
noisyenhanced
Fig. 3. A example of a two-formant vocal-tract frequencyresponse which is excited by white noise to synthesize unvoicedspeech.
V. Tyagi et al. / Speech Communication 48 (2006) 1528–1544 1533
3.2. Unvoiced speech
We model unvoiced speech s(n) as the output ofan all pole transfer function (whose poles are atz ¼ e�aunvoiced
i xi ) excited by a white noise signale(n). Specifically,
SðzÞ ¼ EðzÞQQi¼1 z� e�aunvoiced
i þjxi
� �z� e�aunvoiced
i �jxi
� �ð16Þ
where S(z), E(z) are the z-transforms of unvoicedspeech signal s(n) and white noise excitation signale(n) respectively. Then it can be shown that theautocorrelation coefficients of the unvoiced speechare also decaying exponentials (Haykin, 1993, p.118), i.e.,
runvoicedðkÞ ¼XQ
i¼1
e�aunvoicedi k cosðxikÞ; ð17Þ
where the decaying factor aunvoicedi > avoiced ¼ 1=N
(where N is the block length). This is due to the factthat voiced speech has sharper spectral peaks thanthe unvoiced speech. Consequently the autocorrela-tion coefficients of the unvoiced speech decay muchfaster than those of the voiced speech. However, thefunctional form for the autocorrelation coefficientsof the voiced and unvoiced speech is the same,except that avoiced < aunvoiced. In presence of whitenoise, the power spectral density of the noisy un-voiced speech segment is given by
SxxðzÞ ¼XQ
i¼1
NA2i ð1� e�2aiÞ
2
� 1
ðz� e�aiþjxiÞðz�1 � e�ai�jxiÞ
�
þ 1
ðz� e�ai�jxiÞðz�1 � e�aiþjxiÞ
�þ Nr2 ð18Þ
where ai is a decay factor of the ith pole pair. Wenote that the functional form of the power spectraldensities in (18) and (15) are the same except that ai
in (18) will in general be greater than avoiced in (15).Therefore the functional form of the LeSF filter w in(12) remains the same for both voiced and unvoicedspeech. Its just that for the unvoiced speech thebandwidth of the pass-bands of the LeSF will bewider than that of voiced LeSF. In Fig. 3, we showa transfer function with two complex-pole pairs (atconjugate symmetric positions) that is used to syn-thesize unvoiced speech by exciting it with white
noise. First pane shows the pole–zero plot. Secondpane shows the frequency response of this all-polemodel. In the third pane, blue, red and green curvesare the FFT magnitudes of the clean speech, noisyspeech corrupted by white noise at SNR �3 dBand the LeSF enhanced speech respectively. Thefact that the green curve matches the blue curveclosely, shows that the LeSF has been able to filterout the noise component.
3.3. Analytic form of LeSF
From now onward we will not make any distinc-tion between the exponential decay factors avoiced
and aunvoiced as the functional form of the equationsremain the same. Therefore the following discussionis valid for both voiced speech and unvoiced speech.
To be able to use the general form of the solutionof the LeSF filter as in (12), we need a pole–zeromodel of the input autocorrelation in the form asdescribed in (11). Under the approximation thatthe decaying exponentials are widely spaced alongthe unit circle, the power spectrum Sxx(z) in (15)that consists of sum of certain terms can be
4 Due to self cancelling positive and negative half periods of asinusoid.
1534 V. Tyagi et al. / Speech Communication 48 (2006) 1528–1544
approximated by a ratio of the product of terms (ofthe form (z � eq+jh)), leading to a rational ‘z’-trans-form. Specifically, as explained in (Satorius et al.,1978; Anderson et al., 1983) and making the follow-ing assumptions:
• The pole pairs in (15) lie sufficiently close to theunit circle (easily satisfied as a ’ 0).
• All the frequency pairs (xi,xj) in (15) are suffi-ciently separated from each other such that theircontribution to the total power spectrum do notoverlap significantly.
The z transform of the total input can beexpressed as
SxxðzÞ ¼ r2
QMm¼1ðz� e�bmþjxmÞðz� eþbmþjxmÞQMm¼1ðz� e�amþjxmÞðz� eþamþjxmÞ
� ðz� eþbm�jxmÞðz� e�bm�jxmÞðz� eþam�jxmÞðz� e�am�jxmÞ
where am ¼ 1=N ð19Þ
Corresponding to each of the sinusoidal componentin the input signal there are four poles at locationsz ¼ eaxm and there are four zeros on the same ra-dial lines as the signal poles but at different distancesaway from the unit circle. Using the general solutiondescribed in (12), which has been derived at lengthin (Anderson et al., 1983), the solution of the LeSFweight vector to the present problem is
wðnÞ ¼XM
m¼1
ðBme�bmn þ CmeþbmnÞ cos xmðnþ P Þ ð20Þ
The values of bm, Bm and Cm can be determinedby substituting (20) and (14) in (9). The lthequation in the linear-system described in (9)has terms with coefficients exp(�bml), exp(+bml),exp(�al)cos(xm(l + P)) and exp(al)cos(xm(l + P)).Besides these, there are two other kind of terms thatcan be neglected. The detailed analytic solution ispresented in Appendix A. We note that the analyticsolution for the filter weights w(n) has been devel-oped only for the special when the noise is white.Unfortunately, it is not possible to derive a closedform (analytic) solution of the filter weights w(n)for non-white noises. However, as the filter weightw(n) is a continuous function of the noise autocorre-lation coefficients (A.6) and the white noise is thelimiting case of the broad-band noise when thebandwidth becomes infinite (or equal to the Nyquistfrequency for the discrete systems, as is the case
here), we expect the following. The filter weightw(n) for a non-white and broad-band noise willapproximately follow a behavior similar to the caseof the w(n) when the noise is white. However, if thenoise is significantly narrowband, then this discus-sion does not hold true.
• ‘‘Non-stationary’’ terms that are modulated by asinusoid at frequency 2xm where m 2 [1, M]. Forxm 5 0, xm 5 p, their total contribution isapproximately zero.4
• Interference terms that are modulated by a sinu-soid at frequency Dx = (xi � xj), where(i, j) 2 [1, . . . ,M]. If filter length L� 2p/Dx,these interference terms approximately sum upto zero and hence can be neglected.
The coefficients of the terms exp(�bml),exp(+bml) are the same for each of the L equationsand setting them to zero leads to just one equationwhich relates bm to a and the SNR. Let qi denotethe ‘‘partial’’ SNR of the sinusoid at frequency xi,i.e., qi ¼ A2
i =r2 and the complementary signal
SNR be denoted as ci ¼ ðPM
m¼1;m6¼iA2i Þ=r2. Then we
have the following relation:
cosh bi ¼ cosh aþ qi
2ci þ qi þ 2sinh a ð21Þ
There are two interesting cases. First case is whenthe sinusoid at frequency xi is significantly strongerthan other sinusoids such that ci is quite low. This isillustrated in Fig. 4, where we plot the bandwidth bi
of the LeSF’s pass-band that is centered around xi
as a function of the partial SNR of the ith sinusoid,qi. The complementary signal’s SNR is quite low atci = �6.99 dB. We plot curves for different ‘‘effec-tive’’ input sinusoid’s bandwidth a. From (15), wenote that a is reciprocal of frame length N. The ver-tical line in Fig. 4 corresponds to the case whenqi = ci. We note that for a given partial SNR qi,the LeSF bandwidth becomes narrower as the framelength N increases, indicating a better selectivity ofthe LeSF filter.
In Fig. 5, we plot the bandwidth bi as a functionof qi for the cases when complementary signal SNRis high at ci = 10 dB and is low at ci = �6.99 dB.The two dots correspond to the case when qi = ci.We note that ci = 10 dB corresponds to a signal
−50 −40 −30 −20 −10 0 10 20 30−30
−25
−20
−15
−10
−5
0
Partial SNR of a sinusoid in db
Ban
dw
idth
of
the
filt
er in
db
Complementary signal SNR= 10
Complementary signal SNR= −6.99
Fig. 5. Plot of the filter bandwidth bi centered around frequencyxi as a function of partial sinusoid SNR qi for given comple-mentary signal SNRs ci = �6.99 dB, 10 dB, respectively. The‘‘effective’’ input bandwidth a(a) = 0.01 for both the curves. Thetwo dots correspond to the cases when the partial SNR qi is equalto complementary signal SNR ci.
−50 −40 −30 −20 −10 0 10 20 30−30
−25
−20
−15
−10
−5
0
Partial SNR of a sinusoid in db, complementary signal SNR= −6.99
Ban
dw
idth
of
the
filt
er in
db
alpha=0.01
alpha=0.005
alpha=0.001
Fig. 4. Plot of the filter bandwidth bi centered around frequencyxi as a function of partial sinusoid SNR qi for a givencomplementary signal SNR ci = �6.99 dB and ‘‘effective’’ inputbandwidth a(a) = 0.01, 0.005, 0.001 respectively. The vertical linemeets the three curves when qi = ci.
0 0.1 0.2 0.3 0.4 0.5−30
−25
−20
−15
−10
−5
0
Normalized frequency
Mag
nit
ud
e re
spo
nse
(in
db
)
SNR=3
SNR=−7
SNR=−17
V. Tyagi et al. / Speech Communication 48 (2006) 1528–1544 1535
with high overall SNR.5 Therefore the cross-overpoint (ci = qi) for low ci occurs at narrower band-width as compared to high ci case. This is so becausein the former case the overall signal SNR is low andthus the LeSF filter has to have narrower pass-bands to reject as much of noise as possible.
5 As overall SNR of the signal = 10log10ð1010ci þ 1010qi Þ.
Bi and Ci in (20) are determined by equating theirrespective coefficients. The ‘‘non-stationary’’ inter-ference terms between all of the pairs of the fre-quency (xi,xj), can be neglected if (xi � xj)� 2p/L. This requires that LeSF’s frequency resolution(2p/L) should be able to resolve the constituentsinusoids
Bi ¼2e�bi e�aP ðaþ biÞ
2ðbi � aÞððaþ biÞ
2 � e�2biLðbi � aÞ2Þ
Ci ¼2e�bið2Lþ1Þþ1e�aP ðaþ biÞðbi � aÞ2
ððaþ biÞ2 � e�2biLðbi � aÞ2Þ
ð22Þ
We note from (21) that the various sinusoids arecoupled with each other through the dependenceof their bandwidth bi on the complementary signalSNR ci. As a consequence of that Bi,Ci are also indi-rectly dependent on the powers of the other sinu-soids through bi.
In Fig. 6, the magnitude response of the LeSF fil-ter is plotted for various SNR. The input in this caseconsist of three sinusoids at normalized frequencies(0.1, 0.2, 0.4). The frame length is N = 500 and filterlength is (L = 100). As the signal SNR decreases,the bandwidth of the LeSF filter starts to decreasein order to reject as much of noise as possible.The LESF filter’s gain decreases with decreasingSNR. Similar results were reported in (Andersonet al., 1983; Zeidler et al., 1978) for the case ofstationary inputs.
In Fig. 7, we plot the spectrograms of a cleanspeech utterance. Figs. 8 and 9 display the same
Fig. 6. Plot of the magnitude response of the LeSF filter as afunction of the input SNR. The input consists of three sinusoidsat normalized frequencies (0.1, 0.2, 0.4) with relative strength(1:0.6:0.4) respectively.
Fig. 7. Clean spectrogram of an utterance from the OGINumbers95 database.
Fig. 8. Spectrogram of the utterance corrupted by white noise at6 dB SNR.
Fig. 9. Spectrogram of the noisy utterance (white noise)enhanced by a (L = 100) tap LeSF filter that has been estimatedover blocks of length (N = 500).
1536 V. Tyagi et al. / Speech Communication 48 (2006) 1528–1544
utterance embedded in white noise at SNR 6 dB andits LeSF enhanced version, respectively. As can besee from these spectrograms, LeSF has been ableto reject the additive white noise to a large extentwhile retaining most of the speech signal. Figs. 10and 11 display the same utterance embedded inF16-cockpit noise at SNR 6 dB and its LeSFenhanced version respectively. As can be seen fromthe spectrograms, except for the narrow-band noisecomponent centered around 2.5 kHz, the LeSF filterhas been able to reject significant amount of addi-tive F-16 cockpit noise (Varga et al., 1992) fromthe noisy speech signal. By design, the LeSF putspass-bands around those regions of the input sig-nal’s spectrum that has high spectral energy density.
Fig. 11. Spectrogram of the noisy utterance (F16-cockpit noise)enhanced by a (L = 100) tap LeSF filter that has been estimatedover blocks of length (N = 500).
Fig. 10. Spectrogram of the utterance corrupted by F16-cockpitnoise at 6 dB SNR.
−40 −30 −20 −10 0 10 20 30
6
8
10
12
14
16
18
20
22
24
Input SNR in dB.
LeS
F g
ain
in d
B
L=100
L=80
L=60
Fig. 12. LeSF gain plotted as a function of input SNR for fixedblock length N = 500 and various filter lengths L = 100, 80, 60.
−40 −30 −20 −10 0 10 20 30
6
8
10
12
14
16
18
20
22
24
LeS
F g
ain
in d
B
N=500
N=400
N=300
V. Tyagi et al. / Speech Communication 48 (2006) 1528–1544 1537
As a consequence of that, the LeSF has not beenable to well attenuate the narrow band noise inthe spectrograms of Figs. 10 and 11. However, ithas attenuated the broad-band noise componentsquite well.
4. Gain of the LeSF filter
The LeSF filter output consist of filtered sinu-soids and the filtered noise signal. For the inputsignal described by (7) that is filtered by a LeSFfilter with coefficients as in (20), the output filteredsignal power Psignal and the output filtered noisepower Pnoise are approximately6 given by
P signal ¼XL�1
i¼0
XL�1
j¼0
wðiÞwðjÞrðji� jjÞ ð23Þ
¼XL�1
i¼0
XL�1
j¼0
wðiÞwðjÞ ðN � ji� jjÞN
�XM
m¼1
A2m cosð2pfmði� jÞÞ ð24Þ
P noise ¼XL�1
n¼0
r2w2ðnÞ
¼XM
i¼1
½B2i e�biL þ C2
i ebiL� r2 sinhðbiLÞ
2biþ r2BCL
ð25Þ
The output SNR of the LeSF filter is given by (23)divided by (25). The LeSF gain is given by the ratioof the output SNR to the input SNR ¼PM
i¼1A2i =ð2r2Þ.
In Fig. 12, we plot the LeSF broadband gain as afunction of input SNR, for a fixed block length ofN = 500 and varying filter lengths (L = 100, 80,60). As can be noted from Fig. 12, the LeSF broad-band gain approaches a horizontal asymptote fordecreasing input SNR. This is in agreement withthe fact that the bandwidth bi of the LeSF filterapproaches a horizontal asymptote (Fig. 4) fordecreasing SNR. We note from Fig. 12 that theLeSF broadband gain increases as the filter lengthL increases for a fixed block length N = 500. How-ever, since the L tap LeSF filter is estimated usingthe N samples from a given block, the filter lengthcannot be increased arbitrarily and is limited fromabove by the block length ‘N’.
6 Assuming N� L such that initial L samples can be used toinitialize the filter.
In Fig. 13, we plot the LeSF broadband gain as afunction of input SNR, for a fixed filter lengthL = 100 and varying block lengths (N = 300, 400,500). We note that the LeSF broadband gainincreases as the block length N increases. However,for a non-stationary signal such as speech, as theblock length increases, the corresponding powerspectrum will become more broad-band. Thereforewe will not be able to model the correspondingblock as a sum of a small number of sinusoids M
as done in (7). As a result the number of sinusoidsM will be large and possibly closely spaced to eachother, leading to significant interference termsbetween the constituent sinusoids in (8) and (14).
Input SNR in dB
Fig. 13. LeSF gain plotted as a function of input SNR for fixedfilter length L = 100 and various block lengths N = 300, 400, 500.
Table 1Word error rate results for factory noise using soft-decisionspectral subtraction
SNR LeSF MFCC POST-FILT POWER-FILT PSIL
Clean 6.6 8.1 8.3 7.1
1538 V. Tyagi et al. / Speech Communication 48 (2006) 1528–1544
5. Experiments and results
In order to assess the effectiveness of the pro-posed algorithm, speech recognition experimentswere conducted on the OGI Numbers (Cole et al.,1994) corpus. This database contains spontaneouslyspoken free-format connected numbers over a tele-phone channel. The lexicon consists of 31 words.7
Hidden Markov Model and Gaussian MixtureModel (HMM–GMM) based speech recognitionsystems were trained using public domain softwareHTK (Young et al., 1995) on the clean training setfrom the original Numbers corpus. The system con-sisted of 80 tied-state triphone HMM’s with 3 emit-ting states per triphone and 12 mixtures per state.To verify the robustness of the features to noise,the clean test utterances were corrupted usingWhite, Factory and F-16 cockpit noise from theNoisex92 (Varga et al., 1992) database.
5.1. Bulk delay P
Noting that the autocorrelation coefficients of aperiodic signal are themselves periodic with thesame period (hence they do not decay with theincreasing lag), Sambur (1978) has used a bulk delayequal to the pitch period of the voiced speech for itsenhancement. However, for the un-voiced speech ahigh bulk delay will result in a significant distortionby the LeSF filter as its autocorrelation coefficientsdecay much more rapidly than those of the voicedspeech. Therefore, we kept the bulk delay at‘P = 1’ as a good choice for enhancing both thevoiced and un-voiced speech frames.
5.2. Block length N and filter length L
Speech signals were blocked into frames of(N = 500) samples (62.5 ms) each and a (L = 100)tap LeSF filter was derived using (4), through theLevinson–Durbin algorithm, for each frame thatcould be either voiced or unvoiced. The relativelyhigh order (L = 100) of the LeSF filter is requiredfor a twofold reason. Firstly, it provides sufficientlyhigh frequency resolution (2p/L) to resolve theconstituent sinusoids in case of the voiced speech.Secondly, it causes de-correlation between the noiseappearing at the kth filter tap (u(n � k � P + 1),
7 With confusable words such as nine, ninety and nineteen,eight, eighty and eighteen and so forth.
k � L) and the noise component of the reference sig-nal u(n). Each speech frame was then filteredthrough its corresponding LeSF filter to derive anenhanced speech frame. Finally MFCC feature vec-tor was computed from the enhanced speech frame.These enhanced LeSF-MFCC were compared tothe baseline MFCC features and noise robustCJ-RASTA-PLP (Hermansky and Morgan, 1994)features. The MFCC feature vector computation isthe same for the baseline and the LeSF-MFCCfeatures. The only difference is that the MFCC base-line features are computed directly from the noisyspeech while the LeSF-MFCC features arecomputed from LeSF enhanced speech signal. Wealso compared our technique with the soft-decisionspectral subtraction based technique. In Lathoudet al. (2005), authors have used a speech presenceprobability in conjunction with spectral subtractionto achieve noise robustness. This can be seen as asoft-decision spectral subtraction which has beenshown to be superior than hard-decision spectralsubtraction by McAulay and Malpass (1980). Asthe train set, test set and the factory noise environ-ment in our experiments and those of Lathoud et al.(2005) are the same, we quote the ASR results forthe factory noise directly from Lathoud et al.(2005). The authors in (Lathoud et al., 2005)propose three features based on the soft-spectralsubtraction, which vary in their pre-processing stepsand are termed POST-FILT, POWER-FILT andPSIL. In Table 1, we quote their ASR word errorrates in the factory noise environment, directly fromthe results reported in (Lathoud et al., 2005). Wenote that the proposed technique outperforms allthree soft-decision spectral subtraction variants.
The speech recognition results for the baselineMFCC, CJ-RASTA-PLP and the proposed LeSF-MFCC, in various levels of noise are given in Tables2–4. All the reported features in this paper havecepstral mean subtraction (CMS). The proposedLeSF processed MFCC performs significantly
12 dB 11.3 16.2 17.0 15.76 dB 20.0 30.7 31.2 28.70 dB 41.3 63.1 61.9 58.2
All features have cepstral mean subtraction.
Table 2Word error rate results for factory noise
SNR MFCC CJ-RASTA-PLP LeSF MFCC
Clean 5.7 7.8 6.612 dB 12.3 12.2 11.36 dB 27.1 23.8 20.00 dB 71.0 59.8 41.3
Parameters of the LeSF filter, L = 100 and N = 500. All featureshave cepstral mean subtraction.
Table 3Word error rate results for white noise
SNR MFCC CJ-RASTA-PLP LeSF MFCC
Clean 5.7 7.8 6.612 dB 16.4 14.7 17.36 dB 34.6 29.0 24.10 dB 80.3 66.0 40.4
Parameters of the LeSF filter, L = 100 and N = 500. All featureshave cepstral mean subtraction.
Table 4Word error rate results for F16-cockpit noise
SNR MFCC CJ-RASTA-PLP LeSF MFCC
Clean 5.7 7.8 6.612 dB 13.6 14.2 12.56 dB 28.4 25.3 21.00 dB 72.3 59.2 41.0
Parameters of the LeSF filter, L = 100 and N = 500. All featureshave cepstral mean subtraction.
V. Tyagi et al. / Speech Communication 48 (2006) 1528–1544 1539
better than others in various kinds of heavy noiseconditions (SNR 6,0). The slight performance deg-radation of the LeSF-MFCC in the clean is due tothe fact that the LeSF filter being an all-pole filterdoes not model the valleys of the clean speech spec-trum well. As a result, the LeSF filter sometimesamplifies the low spectral energy regions of the cleanspectrum.
Table 5 shows the word error rate of the LeSFenhanced MFCC features for a fixed block length
Table 5Word error rate results for factory noise for varying length,L = 100, 50, 20 of the LeSF filter
SNR LeSF L = 20 LeSF L = 50 LeSF L = 100
Clean 9.3 7.3 6.612 dB 14.3 12.3 11.36 dB 24.4 22.0 20.00 dB 46.5 43.0 41.3
The block length, N is 500 (62.5 ms).
of 500 samples (62.5 ms) and varying LeSF filterlength ‘L’. We note that the word error ratedecreases as the filter length increases. This is sobecause a higher filter length results in a sharper fre-quency response of the LeSF filter(narrower band-width of the passbands), thereby enabling it to rejectas much of the broad-band noise as possible thatlies away from the frequencies of the constituentsinusoids of the clean signal.
6. Conclusion
We consider a class of non-stationary signals asinput that are composed of either (a) multiple sinu-soids (voiced speech) whose frequencies and theamplitudes may vary from block to block or, (b)output of an all-pole filter excited by white noiseinput (unvoiced speech segments) and which areembedded in white noise. We have derived theanalytical expressions for the impulse response ofthe L-weight least squares filter (LesF) as a functionof the input SNR (computed over the currentframe), effective band-width of the signal (due tofinite frame length), filter length ‘L’ and framelength ‘N’. Recognizing that such a time-varyingsinusoidal model (McAulay and Quatieri, 1986)and the source–filter model are a reasonableapproximation to the voiced speech and unvoicedspeech respectively, we have applied the block esti-mated LeSF filter for de-noising speech signalsembedded in the realistic (Varga et al., 1992)broad-band noise as commonly encountered on afactory floor and an aircraft cockpit. The pro-posed technique leads to a significant improve-ment in ASR performance as compared to noiserobust CJ-RASTA-PLP (Hermansky and Morgan,1994), speech presence probability based spectralsubtraction (Lathoud et al., 2005) and the MFCCfeatures computed from the unprocessed noisysignal.
Appendix A. Solution to the equation
A.1. Autocorrelation over a block of samples
We consider the signal x(n) as
xðnÞ ¼XM
i¼1
Ai cosðxinþ /iÞ þ uðnÞ ðA:1Þ
where n 2 [0, N � 1] and u(n) is a realization of whitenoise. Then the kth lag autocorrelation is given by
1540 V. Tyagi et al. / Speech Communication 48 (2006) 1528–1544
rðkÞ ¼XN�k�1
n¼0
xðnÞxðnþ kÞ
¼XN�k�1
n¼0
XM
i¼1
Ai cosðxinþ /iÞ þ uðnÞ !
�XM
j¼1
Aj cosðxjðnþ kÞ þ /jÞ þ uðnþ kÞ !
¼XN�k�1
n¼0
XM
i¼1
A2i cosðxinþ /iÞ cosðxiðnþ kÞ þ /iÞ
þXN�k�1
n¼0
XM
i¼1
XM
j¼1;j6¼i
AiAj cosðxinþ /iÞ cosðxjðnþ kÞ þ /jÞ
þXN�k�1
n¼0
XM
j¼1
AjuðnÞ cosðxjðnþ kÞ þ /jÞ
þXN�k�1
n¼0
XM
i¼1
Aiuðnþ kÞ cosðxinþ /iÞ
þXN�k�1
n¼0
uðnÞuðnþ kÞ!
¼XN�k�1
n¼0
XM
i¼1
A2i
2ðcosðxikÞ þ cosð2xinþ xik þ 2/iÞÞ
!
þXN�k�1
n¼0
XM
i¼1
XM
j¼1;j6¼i
AiAj
2ðcosððxi � xjÞn� xjk þ /i � /jÞ
þ cosððxi þ xjÞnþ xjðkÞ þ /i þ /jÞÞ
þXN�k�1
n¼0
XM
j¼1
AjuðnÞ cosðxjðnþ kÞ þ /jÞ
þXN�k�1
n¼0
XM
i¼1
Aiuðnþ kÞ cosðxinþ /iÞ
þXN�k�1
n¼0
uðnÞuðnþ kÞ
¼ ðN � kÞXM
i¼1
A2i
2cosðxikÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
SIG
þXM
i¼1
A2i
2
XN�k�1
n¼0cosð2xinþ xik þ 2/iÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
A
þXM
i¼1
XM
j¼1;j6¼i
AiAj
2
XN�k�1
n¼0cosððxi � xjÞn� xjk þ /i � /jÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
B
þXM
i¼1
XM
j¼1;j6¼i
AiAj
2
XN�k�1
n¼0cosððxi þ xjÞnþ xjðkÞ þ /i þ /jÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
C
þXM
j¼1
XN�k�1
n¼0AjuðnÞ cosðxjðnþ kÞ þ /jÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
D
þXM
i¼1
XN�k�1
n¼0Aiuðnþ kÞ cosðxinþ /iÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
E
þXN�k�1
n¼0uðnÞuðnþ kÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
F
ðA:2Þ
Let us consider the under-braced terms A, B, C.They are the sums of the samples of a cosine waveat frequencies, 2xi, xi � xj, xi + xj respectively. If(N � k) is much greater than the 2p
xiand 2p
xi�xjfor
all frequency pairs (i, j), then the sums A, B, C willcontain several periods of their corresponding co-sine waves. Sum over ‘Q’ periods of the samples ofa cosine wave at any non-zero frequency is zero.This can be seen by the following integral whichapproximates the sum of the samples in A, B, C,Z t¼Q2p=x
t¼0
A cosðxt þ /Þdt ¼ 0 ðA:3Þ
This happens as the negative and positive swings ofthe cosine cancel each other. Therefore A, B, C areapproximately zero. Let (N � k) = Q · T + D,where Q is an integer and T is the period of a certainsinusoid at frequency x and D is the left over part asN � k is not an exact multiple of T. Hence D < T.Then we haveXn¼N�k
n¼0
A cosðxnþ /Þ
¼Xn¼QT
n¼0
A cosðxnþ /Þ þXn¼N�k
n¼QTþ1
A cosðxnþ /Þ
¼Xn¼N�k
n¼QTþ1
A cosðxnþ /Þ < A� D� A� ðN � kÞ
ðA:4Þ
This proves the A, B, C can safely be ignored incomparison to SIG term which is proportional to(N � k). Moreover D, E are also approximately zeroas the noise u(n) is assumed uncorrelated with signals(n). The term F = Nr2d(k) as the noise is assumedto be white. Hence ignoring A, B, C, D, E, we get
rðkÞ ¼XN�k�1
n¼0
xðnÞxðnþ kÞ
’XM
i¼1
ðN � kÞA2i cosð2pfikÞ þ Nr2dðkÞ
’XM
i¼1
ðN expð�akÞÞA2i cosð2pfikÞ þ Nr2dðkÞ
ðA:5Þwhere a = 1/N and hence a� 1.
A.2. Solving the least squares matrix equation
In the section, we will analytically solve the LeSFequation for the form of the autocorrelation coeffi-
V. Tyagi et al. / Speech Communication 48 (2006) 1528–1544 1541
cients given above in (A.5). The (L · L) matrixLeSF equation is reproduced below
rð0Þ rð1Þ rð2Þ � � � rðL� 1Þrð1Þ rð0Þ rð1Þ � � � rðL� 2Þ� � � � � � � � � � � � � � �
rðL� 1Þ rðL� 2Þ � � � � � � rð0Þ
0BBB@
1CCCA
�
wð0Þwð1Þ� � �
wðL� 1Þ
0BBB@
1CCCA ¼
rðP ÞrðP þ 1Þ� � �
rðP þ L� 1Þ
0BBB@
1CCCA ðA:6Þ
where the w(0),w(1), . . . ,w(L � 1) are the LeSF filtertap weights and the autocorrelation coefficients r(k)are given by (A.5). In (Anderson et al., 1983), it hasbeen shown that the functional form of filter tapweights in (A.6) for the form of the r(k) as in(A.5), is given by
wðkÞ¼XM�1
i¼1
ðBi expð�bikÞþCi expðbikÞÞcosðxiðkþPÞÞ
ðA:7Þ
Xp�1
k¼0
½Bi expð�bikÞ þ Ci expðbikÞ� cosðxiðk þ PÞÞ½A2i expð�að
þ ½Bi expð�bipÞ þ Ci expðbipÞ� cosðxiðp þ P ÞÞXM
i¼1
A2i þ
"
þ Ci expðbikÞ� cosðxiðk þ PÞÞ½A2i expð�aðk � pÞÞ cosðxi
¼ 1
2A2
i cosðxiðp þ P ÞÞXk¼p�1
k¼0½Bi expð�bikÞ þ Ci expðbi|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl
stationary
þ 1
2A2
i
Xk¼p�1
k¼0½Bi expð�bikÞ þ Ci expðbikÞ� expð�aðp|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl
non-stationary
þ ½Bi expð�bipÞ þ Ci expðbipÞ� cosðxiðp þ PÞÞXM
i¼1
�|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl
stationary
� 1
2A2
i cosðxiðp þ P ÞÞXk¼L�1
k¼pþ1½Bi expð�bikÞ þ Ci expð|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl
stationary
þ 1
2A2
i
Xk¼L�1
k¼pþ1½Bi expð�bikÞ þ Ci expðbikÞ� expð�aðk|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl
non-stationary
where P is the bulk delay. In (A.7), the quantities Ci,Bi, bi are unknown. Our objective is to solve (A.6)for the unknown quantities in the filter tap weightsw in closed form. Towards this end, let us considerthe (p + 1)th equation in the system of Eq. (A.6)which is reproduced below
Xk¼p�1
k¼0
rðp � kÞwðkÞ þ rð0ÞwðpÞ þXL�1
k¼pþ1
rðk � pÞwðkÞ
¼ rðP þ pÞ ðA:8Þ
Next, we substitute the functional forms of r(k) andw(k) in (A.8). We collect the terms that correspondto the ith sinusoid together and call them as ‘‘self-terms’’ while the terms that have contribution fromthe ith and the jth sinusoid are called ‘‘cross terms’’.As, we will show that some of these cross terms canbe ignored in comparison to the ‘‘self-terms’’, thusfacilitating analytical solution. The ‘‘self-terms’’due to the ith sinusoid, on the right-hand side of(A.8) are,
p � kÞÞ cosðxiðp � kÞÞ�
r2
# XL�1
k¼pþ1
½Bi expð�bikÞ
ðk � pÞÞ�
kÞ� expð�aðp � kÞÞfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}� kÞÞ cosð2xik � xiðp � P ÞÞfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}A2
i þ r2�
fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}bikÞ� expð�aðk � pÞÞfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}� pÞÞ cosð2xik � xiðp � P ÞÞfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} ðA:9Þ
1542 V. Tyagi et al. / Speech Communication 48 (2006) 1528–1544
The ‘‘non-stationary’’ terms approximately sum upto zeros due to the self-canceling positive and nega-tive swings of the sinusoid at frequency (2xi) andhence can be ignored in comparison to the ‘‘station-ary terms’’. Similarly in the (p + 1)th equation, thereare ‘‘cross-terms’’ that get contribution from the ithand the jth sinusoid. These terms are given below
Xp�1
k¼0
½Bi expð�bikÞ þ Ci expðbikÞ� cosðxiðk þ P ÞÞ½A2j expð�aðp � kÞÞ cosðxjðp � kÞÞ�
þXL�1
k¼pþ1
½Bi expð�bikÞ þ Ci expðbikÞ� cosðxiðk þ PÞÞ½A2j expð�aðk � pÞÞ cosðxjðk � pÞÞ�
¼ 1
2A2
j
Xk¼p�1
k¼0½Bi expð�bikÞ þ Ci expðbikÞ� expð�aðp � kÞÞ½cosððxi � xjÞðkÞ þ xiP þ xjpÞ�|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
non-stationary
þ 1
2A2
j
Xk¼p�1
k¼0½Bi expð�bikÞ þ Ci expðbikÞ� expð�aðp � kÞÞ½cosððxi þ xjÞðkÞ þ xiP � xjpÞ�|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
non-stationary
þ 1
2A2
j
Xk¼L�1
k¼pþ1½Bi expð�bikÞ þ Ci expðbikÞ� expð�aðk � pÞÞ½cosððxi � xjÞk þ xiP þ xjpÞ�|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
non-stationary
þ 1
2A2
j
Xk¼L�1
k¼pþ1½Bi expð�bikÞ þ Ci expðbikÞ� expð�aðk � pÞÞ½cosððxi þ xjÞðkÞ þ xiP � xjpÞ�|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
non-stationary
ðA:10Þ
If ðxi � xjÞ � 2pL and (xi + xj) 5 2p, then these
‘‘non-stationary’’ terms approximately sum up tozero too. Therefore, all the cross-terms noted abovecan be safely neglected on the right-hand side of theEq. (A.8). Therefore neglecting all the non-station-ary terms in (A.9) and (A.10), we get,
1
2A2
i cosðxiðp þ PÞÞXk¼p�1
k¼0½Bi expð�bikÞ þ Ci expðbikÞ� ex|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl
stationary
þ ½Bi expð�bipÞ þ Ci expðbipÞ� cosðxiðp þ P ÞÞXM
i¼1A2
i
�|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl
stationary
þ 1
2A2
i cosðxiðp þ P ÞÞXk¼L�1
k¼pþ1½Bi expð�bikÞ þ Ci expðbi|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl
stationary
¼XM
i¼1
expð�aðp þ P ÞÞA2i cosðxiðp þ P ÞÞ
Next, we collect all the terms in (A.11) with the coef-ficients exp(�bip), exp(+bip) for each of the ith sinu-soid and set them to zero as there are no terms onthe right-hand side of (A.11) with these coefficients.Consider the terms with the coefficient exp(�bip),which are given below, and is set to zero asexplained above
XM
i¼1
�A2i
2
cosðxiðpþPÞÞBi expð�bipÞð1�expð�ðbi�aÞÞÞ
"
þBi expð�bipÞcosðxiðpþPÞÞXM
i¼1
A2i þr2
!#
þXM
i¼1
A2i
2
Bi expð�bipÞexpð�ðaþbiÞÞcosðxiðpþPÞÞð1�expð�ðaþbiÞÞÞ
� �¼ 0
ðA:12Þ
pð�aðp � kÞÞfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}þ r2
�fflfflfflfflfflfflfflfflfflfflfflffl}kÞ� expð�aðk � pÞÞfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
ðA:11Þ
V. Tyagi et al. / Speech Communication 48 (2006) 1528–1544 1543
Therefore for each ‘‘i’’, we get the relationship,
cosh bi ¼ cosh aþ qi
2ci þ qi þ 2sinh a ðA:13Þ
where, qi denotes the ‘‘partial’’ SNR of the sinusoidat frequency xi, i.e., qi ¼ A2
i =r2 and the complemen-
tary signal SNR is denoted as ci ¼PMm¼1;m6¼iA
2i
� �=r2q. The coefficients of the terms ex-
p(�bip), exp(+bip) are the same for each of the L
equations and setting them to zero leads to justone equation which relates bi to a, qi and ci. Nextstep is to solve for Bi and Ci. Towards this end,we equate the coefficient of exp(�ap) on both theleft and right-hand sides of (A.11). This leads to:
A2i cosðxiðp þ P ÞÞBi expð�apÞ
1� expð�ða� biÞÞ
þ A2i cosðxiðp þ P ÞÞCi expð�apÞ
1� expð�ðaþ biÞÞ¼ A2
i expð�apÞ expð�aPÞ cosðxiðp þ P ÞÞ ðA:14Þ
Similarly we set the coefficient of exp(+ap) to zeroas there is no term with this coefficient on theright-hand side of (A.11). This leads to,
A2i cosðxiðp þ PÞÞBi expð�ðaþ biÞÞ expð�aLþ ap þ a� biLþ biÞ
1� expð�ðaþ biÞÞ
þ A2i cosðxiðp þ P ÞÞCi expð�aþ biÞ expð�aLþ ap þ aþ biL� biÞ
1� expð�aþ biÞ¼ 0
ðA:15Þ
There are two unknowns Bi and Ci in (A.14) and(A.15). Solving them simultaneously gives
Bi ¼2e�bi e�aP ðaþ biÞ
2ðbi � aÞððaþ biÞ
2 � e�2biLðbi � aÞ2Þ
Ci ¼2e�bið2Lþ1Þþ1e�aP ðaþ biÞðbi � aÞ2
ððaþ biÞ2 � e�2biLðbi � aÞ2Þ
ðA:16Þ
This concludes the analytic solution of the LeSFequation.
References
Anderson, C.M., Satorius, E.H., Zeidler, J.R., 1983. Adaptiveenhancement of finite bandwidth signals in white gaussiannoise. IEEE Trans. ASSP ASSP-31 (1).
Bershad, N., Feintuch, P., Reed, F., Fisher, B., 1980. Trackingcharacteristics of the LMS adaptive line-enhancer – response
to a linear chirp signal in noise. IEEE Trans. ASSP ASSP-28,504–517.
Cole, R.A., Fanty, M., Lander, T., 1994. Telephone speechcorpus at CSLU. In: Proc. ICSLP, Yokohama, Japan.
Compton, R., 1980. Pointing accuracy and dynamic range in asteered beam antenna array. IEEE Trans. Aerosp. Electron.Systems AES-16, 280–287.
Ephraim, Y., Malah, D., 1984. Speech enhancement using aminimum mean square error short-time spectral magnitudeestimator. IEEE Trans. ASSP ASSP-32, 1109–1121.
Ephraim, Y., Malah, D., 1985. Speech enhancement using aminimum mean square error log spectral amplitude estimator.IEEE Trans. ASSP ASSP-33 (2).
Frost, O.L., 1972. An algorithm for linearly constrained adaptivearray processing. Proc. IEEE 60, 926–935.
Gersho, A., 1969. Adaptive equalization of highly dispersivechannels for data transmission. Bell Systems Tech. J. 48, 55–70.
Gibson, C., Haykin, S., 1980. Learning characteristics of adaptivelattice filtering algorithms. IEEE Trans. ASSP ASSP-28, 681–692.
Griffiths, L.J., 1969. A simple adaptive algorithm for realtime processing in antenna arrays. Proc. IEEE 57, 1696–1704.
Haykin, S., 1993. Adaptive Filter Theory. Prentice-Hall Publish-ers, NJ, USA.
Hermansky, H., Morgan, N., 1994. Rasta processing of speech.IEEE Trans. SAP 2 (4).
Kim, H.K., Rose, R.C., 2003. Cepstrum domain acoustic featurecompensation based on decomposition of speech and noisefor ASR in noisy environments. IEEE Trans. SAP 11 (5).
Lathoud, G., Doss, M.M., Mesot, B., 2005. A spectrogram modelfor enhanced source localization and noise robust ASR. In:Proc. Eurospeech, Lisbon, Portugal.
Marple, L., 1981. Efficient least squares FIR system identifica-tion. IEEE Trans. ASSP ASSP-29, 62–73.
Martin, R., 2001. Noise power spectral density estimation basedon optimal smoothing and minimum statistics. IEEE Trans.SAP 9 (5).
McAulay, R.J., Malpass, M.L., 1980. Speech enhancement usinga soft-decision noise suppression filter. IEEE Trans. ASSPASSP-28 (2).
McAulay, R.J., Quatieri, T.F., 1986. Speech analysis/synthesisbased on a sinusoidal representation. IEEE Trans. ASSPASSP-34 (4).
Rabiner, L., Crochiere, R., Allen, J., 1978. FIR system modelingand identification in the presence of noise and band-limitedinputs. IEEE Trans. ASSP ASSP-26, 319–333.
Sambur, M.R., 1978. Adaptive noise canceling for speech signals.IEEE Trans. ASSP ASSP-26 (5).
Satorius, E., Alexander, S.T., 1979. Channel equalization usingadaptive lattice algorithms. IEEE Trans. Comm. 27, 899–905.
Satorius, E., Pack, J., 1981. Application of least squares latticealgorithms for adaptive equalization. IEEE Trans. Comm.COM-29, 136–142.
Satorius, E., Zeidler, J., Alexander, S., 1978. Linear predictivedigital filtering of narrowband processes in additive broad-band noise, Naval Ocean Systems Center, San Diego, CA,Technical Report 331, November 1978.
Sondhi, M., Berkley, D., 1980. Silencing echoes on the telephonenetwork. Proc. IEEE 68, 948–963.
1544 V. Tyagi et al. / Speech Communication 48 (2006) 1528–1544
Varga, A., Steeneken, H., Tomlinson, M., Jones, D., 1992. TheNOISEX-92 study on the effect of additive noise on automaticspeech recognition. Technical Report, DRA Speech ResearchUnit, Malvern, England.
Widrow, B. et al., 1975. Adaptive noise cancelling: principles andapplications. Proc. IEEE 65, 1692–1716.
Young, S., Odell, J., Ollason, D., Valtchev, V., Woodland, P.,1995. The HTK Book. Cambridge University.
Zeidler, J.R., Satorius, E.H., Chabries, D.M., Wexler, H.T.,1978. Adaptive enhancement of multiple sinusoids in uncor-related noise. IEEE Trans. ASSP ASSP-26 (3).