Least squares filtering of speech signals for robust ASR

Speech Communication 48 (2006) 1528–1544

www.elsevier.com/locate/specom

Least squares filtering of speech signals for robust ASR q

Vivek Tyagi *, Christian Wellekens, Dirk T.M. Slock

Institute Eurecom, Multimedia Communications, 2229, route des cretes, P.O. Box 193, 06094 Sophia-Antipolis, PACA, France

Swiss Federal Institute of Technology, Lausanne, Switzerland

Received 31 July 2005; received in revised form 24 July 2006; accepted 28 July 2006

Abstract

The behavior of the least squares filter (LeSF) is analyzed for a class of non-stationary signals that are either (a) com-posed of multiple sinusoids (voiced speech) whose frequencies, phases and the amplitudes may vary from block to block or(b) are output of an all-pole filter excited by white noise input (unvoiced speech segments) and which are embedded inwhite noise. In this work, analytic expressions for the weights and the output of the LeSF are derived as a function ofthe block length and the signal SNR computed over the corresponding block. We have used LeSF filter estimated on eachblock to enhance the speech signals embedded in white noise as well as other realistic noises such as factory noise and anaircraft cockpit noise. Automatic speech recognition (ASR) experiments on a connected numbers task, OGI Numbers95[Varga, A., Steeneken, H., Tomlinson, M., Jones, D., 1992. The NOISEX-92 study on the effect of additive noise on auto-matic speech recognition. Technical Report, DRA Speech Research Unit, Malvern, England] show that the proposedLeSF based features provide a significant improvement in speech recognition accuracies in various non-stationary noiseconditions when compared directly to the un-enhanced speech, spectral subtraction and noise robust CJ-RASTA-PLPfeatures.� 2006 Elsevier B.V. All rights reserved.

Keywords: Least squares; Adaptive filtering; Speech enhancement; Robust speech recognition

1. Introduction

Speech enhancement, amongst other signalde-noising techniques, has been a topic of great

0167-6393/$ - see front matter � 2006 Elsevier B.V. All rights reserved

doi:10.1016/j.specom.2006.07.010

q This work was supported by European Commission’s 6thFramework Program project, DIVINES under the Contract No.FP6-002034.

* Corresponding author. Address: Institute Eurecom, Multi-media Communications, 2229, route des cretes, P.O. Box 193,06094 Sophia-Antipolis, PACA, France. Tel.: +33 493002670;fax: +33 493002627.

E-mail addresses: [email protected] (V. Tyagi), [email protected] (C. Wellekens), [email protected] (D.T.M. Slock).

interest for past several decades. The importanceof such techniques in speech coding and automaticspeech recognition systems can only be understated.Towards this end, adaptive filtering techniques havebeen shown to be quite effective in various signalde-noising applications. Some representative exam-ples are echo cancellation (Sondhi and Berkley,1980), data equalization (Gersho, 1969; Satoriusand Alexander, 1979; Satorius and Pack, 1981), nar-row-band signal enhancement (Widrow et al., 1975;Bershad et al., 1980), beamforming (Griffiths, 1969;Frost, 1972; Compton, 1980), radar clutter rejection(Gibson and Haykin, 1980), system identification

.

mailto:[email protected]




V. Tyagi et al. / Speech Communication 48 (2006) 1528–1544 1529

(Rabiner et al., 1978; Marple, 1981) and speech pro-cessing (Widrow et al., 1975).

Most of the above mentioned representativeexamples require an explicit external noise referenceto remove additive noise from the desired signal asdiscussed in Widrow et al. (1975). In situationswhere an external noise reference for the additivenoise is not available, the interfering noise may besuppressed using a Wiener linear prediction filter(for stationary input signal and stationary noise) ifthere is a significant difference in the bandwidth ofthe signal and the additive noise (Widrow et al.,1975; Zeidler et al., 1978; Anderson et al., 1983).One of the earliest use of the least mean square(LMS) filtering for speech enhancement is due toSambur (1978). In his work, the step size of theLMS filter was chosen to be 1% of the reciprocalof the largest eigenvalue of the correlation matrixof the first voiced frame. However, speech being anon-stationary signal, the estimation of the step sizebased on the correlation matrix of just single frameof the speech signal, may lead to divergence of theLMS filter output. Nevertheless, the exposition inSambur (1978) helped to illustrate the efficacy ofthe LMS algorithm for enhancing naturally occur-ring signals such as speech. In (Zeidler et al.,1978), Zeidler et al. have analyzed the steady statebehavior of the adaptive line enhancer (ALE), animplementation of least mean square algorithm thathas applications in detecting and tracking narrow-band signals in broad-band noise. Specifically, theyhave shown that for a stationary input consisting ofmultiple (‘N’) sinusoids in white noise, the L-weightALE, can be modeled by the L · L Wiener–Hopfmatrix equation and that this matrix can be trans-

Z−P Z−1

w(0) w(1)

+

+

S

s(n) +u(n)

Fig. 1. The basic operation of the LeSF. The input to the filter is noisyweights wk are estimated using the least squares algorithm based on theenhanced signal.

formed into a set of 2N coupled linear equations.They have derived the analytical expression for thesteady-state L-weight ALE filter as function ofinput SNR and the interference between the inputsinusoids. It has been shown that the coupling termsbetween the input sinusoid pairs approach zero asthe ALE filter length increases.

In (Anderson et al., 1983), Anderson et al.extended the above mentioned analysis for a sta-tionary input consisting of finite band-width signalsin white noise. These signals consist of white Gauss-ian noise (WGN) passed through a filter whoseband-width a is quite small relative to the Nyquistfrequency, but generally comparable to the binwidth 1/L. They have derived analytic expressionsfor the weights and the output of the LMS adaptivefilter as function of input signal band-width andSNR, as well as the LMS filter length and bulkdelay ‘z�P’ (please refer to Fig. 1).

In this paper, we extend the previous work in(Anderson et al., 1983; Zeidler et al., 1978) forenhancing a class of non-stationary signals thatare composed of either (a) multiple sinusoids(voiced speech) whose frequencies and the ampli-tudes may vary from block to block or (b) are theoutput of an all-pole filter excited by white noiseinput (unvoiced speech segments) and which areembedded in white noise. The number of sinusoidsmay also vary from block to block. The key differ-ence in the approach proposed in this paper is thatwe relax the assumption of the input signal beingstationary. The method of least squares may beviewed as an alternative to Wiener filter theory(Haykin, 1993, p. 483). Wiener filters are derivedfrom ensemble averages and they require good

Z−1

w(L−1)

+

+

+

−

e(n)

y(n)S

S

speech, (x(n) = s(n) + u(n)), delayed by bulk delay = P. The filtersamples in the current frame. The output of the filter y(n) is the

1 For example, just by the design of the speech databases, theinitial few frames always correspond to the silence and hence canbe used for noise PSD estimation. However in a realistic ASRtask such assumptions cannot be made.

2 Such as raising the Wiener filter or spectral subtraction gainfunction to a certain power which is empirically tuned, dependenton the SNR conditions.

1530 V. Tyagi et al. / Speech Communication 48 (2006) 1528–1544

estimates of the clean signal power spectral density(PSD) as well as the noise PSD. Consequently,one filter (optimum in a probabilistic sense) isobtained for all realizations of the operational envi-ronment, assumed to be wide-sense stationary. Onthe other hand, the method of least squares is deter-

ministic in approach. Specifically, it involves the useof time averages over a block of data, with the resultthat the filter depends on the number of samplesused in the computation. Moreover, the method ofleast squares does not require the noise PSD esti-mate. Therefore the input signal is blocked intoframes and we analyze a L-weight least squares filter(LeSF), estimated on each frame which consists ofN samples of the input signal.

Working under the assumptions that the cleansignal spectral vector and noise spectral vector areGaussian distributed with kth spectral value inde-pendent of jth spectral value, Eprahaim and Malahderived the optimum minimum mean square error(MMSE) estimator of the clean speech’s spectralamplitude (MMSE-STA) (Ephraim and Malah,1984) and its log spectral amplitude (MMSE-LSA)(Ephraim and Malah, 1985). This assumption isvalid only if the clean signal and the noise are bothstationary processes and the spectrum is estimatedover an infinitely long window. Clearly the speechsignal is neither a stationary process nor does ithave a Gaussian distributed spectrum. Moreover,in most of the situations, the noise is not a station-ary process. Besides this, MMSE-LSA, MMSE-STA, spectral subtraction (SS) and Wiener filter(WF) based techniques need a good estimate ofnoise spectrum. It is often claimed that the estimateof the noise PSD can be obtained from‘‘non-speech’’ frames which can be detected usinga pre-tuned threshold (Ephraim and Malah, 1985,1984). However, if the noise power changes (varyingSNR conditions), there is no single threshold whichcan detect the non-speech frames. Moreover if thenoise is non-stationary, the noise PSD estimateobtained through ‘‘non-speech’’ frames may notbe able to track the noise statistics quite well as itis dependent on the availability of non-speechframes which are unevenly distributed in anutterance. Martin (2001) has proposed a noisePSD estimator based on the minimum statistics.However even this techniques relies on certainparameters which need to be tuned depending onthe degree of non-stationarity of the noise. Severalresearchers have tried to use a multitude of well-tailored tuning-parameters dependent on the a prior

knowledge of non-speech frames,1 highest and low-est SNR range, and several other ad hoc weightingfactors2 (Kim and Rose, 2003, p. 438) to achievenoise robustness in ASR.

Therefore it is desirable to develop a newenhancement technique that does not require anexplicit noise PSD estimate. The least squares filter(LeSF) based techniques fall in this category as theydo not explicitly require a noise PSD estimate.Although, the LeSF is optimal only in the case ofthe additive noise being white, the speech recogni-tion experiments, reported in this paper, indicatethat LeSF is also effective in case of non-white addi-tive noises such as the factory noise and the aircraftcockpit noise. MFCC features computed from theLESF enhanced speech signal lead to significantASR accuracy improvements in various noises aswell as SNR conditions. We have derived the ana-lytical expressions for the impulse response of theL-weight least squares filter (LesF) as a functionof the input SNR (computed over the currentframe), effective band-width of the signal (due tofinite frame length), filter length ‘L’ and framelength ‘N’.

2. Least squares filter (LeSF) for signal

enhancement

The basic operation of the LeSF is illustrated inFig. 1 and it can be understood intuitively as fol-lows. The autocorrelation sequence of the additivenoise u(n) that is broad-band decays much fasterfor higher lags than that of the speech signal. There-fore the use of a large filter length (‘L’) and the delayP causes de-correlation between the noise compo-nents of the input signal, namely (u(n � L �P + 1), u(n � L � P + 2), . . . , u(n � P)) and thenoise component of the reference signal, namely(u(n)). It is worth noting that a longer filter lengthL will also help to cause a de-correlation betweenthe noise appearing at the kth filter tap, namely,u(n � k � P + 1) (where k � L) and the noise com-ponent of the reference signal, namely, u(n). Thisis due to the fact that the broad-band noise’s

3 For interpretation of color in Figs. 2 and 3, the reader isreferred to the web version of this article.


auto-correlation coefficients decay quite rapidly forhigher lags. The LeSF filter responds by adaptivelyforming a frequency response which has pass-bandscentered at the frequencies of the formants of thespeech signal while rejecting as much of broad-bandnoise (whose spectrum lies away from the formantpositions). Denoting the clean and the additivenoise signals by s(n) and u(n) respectively, we obtainthe noisy signal x(n)

xðnÞ ¼ sðnÞ þ uðnÞ ð1ÞThe LeSF filter consists of L weights and the filtercoefficients wk for k 2 [0, 1,2, . . . ,L � 1] are esti-mated by minimizing the energy of the error signale(n) over the current frame, n 2 [0,N � 1]

eðnÞ ¼ xðnÞ � yðnÞ ð2Þ

where yðnÞ ¼XL�1

i¼0

wðiÞxðn� P � iÞ ð3Þ

Let A denote the (N + L) · L data matrix (Haykin,1993) of the input frame x = [x(0),x(1), . . . ,x(N � 1)] and d denote the (N + L) · 1 desired sig-nal vector which in this case is signal x appendedby L zeros. The LeSF weight vector w is then givenby

w ¼ ðAH AÞ�1AH d ð4Þ

As is well known, AHA is a symmetric L · L Toep-litz matrix whose (i, j) element is the temporal auto-correlation of the signal vector x estimated over theframe length (Haykin, 1993)

½AH A�i;j ¼ rðji� jjÞ ð5Þ

¼XN�ji�jj

n¼0

xðnÞxðnþ ji� jjÞ ð6Þ

In practice, AHA can always be assumed to be non-singular due to presence of additive noise (Haykin,1993) for filter length L < N. The weight vector win (4) can be obtained using Levinson–Durbin algo-rithm (Haykin, 1993) without incurring a significantcomputational cost.

3. LeSF applied to speech

In this section, we will analytically solve (4) toobtain the LeSF w. We model voiced speech usingsinusoidal model (McAulay and Quatieri, 1986),while unvoiced speech is modeled by a source–filtermodel. However, we show that the functional form

of the equations remain the same except for achange in the parameter values.

3.1. Voiced speech

As proposed in McAulay and Quatieri (1986),voiced speech signals can be modeled as a sum ofmultiple sinusoids whose amplitudes, phases and fre-quencies can vary from frame to frame. Let usassume that a given frame of speech signal s(n) canbe approximated as a sum of M sinusoids. The num-ber of sinusoids M may vary from block to block.Then the noisy signal x(n) can be expressed as

xðnÞ ¼XM

i¼1

Ai cosðxinþ /iÞ þ uðnÞ ð7Þ

where n 2 [0, N � 1] and u(n) is a realization ofwhite noise. Then the kth lag autocorrelation canbe shown to be

rðkÞ ¼XN�k�1

n¼0

xðnÞxðnþ kÞ

’XM

i¼1

ðN � kÞA2i cosð2pfikÞ þ Nr2dðkÞ ð8Þ

where it is assumed that the noise u(n) is white, ergo-dic and uncorrelated with the signal s(n) and N� 1/(fi � fj) for all frequency pairs (i, j). The latter condi-tion ensures that all the interference terms betweenall the sinusoids pairs (i, j) sum up to zero. The LeSFweight vector w(k) is then obtained as the solutionof the Normal equations

XL�1

k¼0

rðl� kÞwðkÞ ¼ rðlþ P Þ l 2 ½0; 1; 2; . . . ; L� 1�

ð9ÞAn enhancement example is illustrated in Fig. 2.The first pane displays the magnitude spectrum ofa clean, voiced speech frame. The second paneshows spectra of the same segment embedded inwhite noise (red curve3) and the spectrum of the en-hanced signal (blue curve). As can be noticed, thebroad band white noise has been attenuated whilethe harmonics have been retained. The third paneshows the magnitude response of the LeSF filterused for enhancing this segment and it wasestimated over the noisy segment itself. As can be

0.0 0.2 0.4 0.6 0.8 1.00

0.5

1

1.5

2

2.5x 10

5

0.0 0.2 0.4 0.6 0.8 1.00

0.5

1

1.5

2

2.5x 10

5

0.0 0.2 0.4 0.6 0.8 1.0200

400

600

800

1000

Discrete Frequency

Mag

nit

ud

e

Fig. 2. The first pane displays spectral magnitude of a cleanspeech segment. Second pane displays the spectral magnitude ofthe same segment corrupted by white noise (red curve), whereasblue curve corresponds to the spectral magnitude of the enhancedsignal. The third pane displays the frequency response of theLeSF filter that enhanced the noisy segment and was estimatedover the noisy segment itself.


noticed the LeSF filter automatically puts the pass-bands around the harmonics, thus enhancing thesignal while rejecting the broad-band noise. In thesections to follow, we will present further examplesand performance specifications of the LeSF filter forenhancing noisy speech signals.

The set of L linear equations described in (9) canbe solved by elementary methods if the z-transform(Sxx(z)) of the symmetric autocorrelation sequence(r(k)) is a rational function of ‘z’ (Satorius et al.,1978). Sxx(z) is given by

SxxðzÞ ¼X1

k¼�1rðkÞz�k ð10Þ

Consider then, a real symmetric rational z-trans-form with M pairs of zeros and M pairs of poles

SxxðzÞ ¼ G

QMm¼1ðz� e�bmþjWmÞðz�1 � e�bm�jWmÞQMm¼1ðz� e�amþjxmÞðz�1 � e�am�jxmÞ

ð11Þ

If the signal x is real, then so is its autocorrelationsequence, r(k). In this case the power spectrum,

Sxx(z), has quadruplet sets of poles and zeros be-cause of the presence of conjugate pairs at z =exp(±am ±xm) and z = exp(±bm ±Wm). Andersonet al. (1983) have derived the general form of thesolution to (9) for input signal with rational powerspectra such as that described by (11). In this case,the LeSF weights are given by

wðkÞ ¼XM

m¼1

ðBme�bmk cosðWmkÞ þ Cmeþbmk cosðWmkÞÞ

ð12Þ

As can be seen, LeSF consists of an exponentiallydecaying term and an exponentially growing termattributed to reflection (Widrow et al., 1975), thatoccurs due to finite filter length L. The value ofthe coefficients Bm and Cm can be determined bysolving the set of coupled equations obtained bysubstituting the expression for w(k) given in (12)into (9).

To be able to use the general form of the solutionof the LeSF filter as in (12), we need a pole–zeromodel of the input autocorrelation in the form asdescribed in (11). For sufficiently large frame lengthN, such that filter length L� N, we can make thefollowing approximation:

ðN � kÞ ’ Ne�k=N k 2 ½0; 1; 2; . . . ; L� and L� N

ð13Þ

The above can be verified by using the Taylor seriesexpansion of Ne�ak and using only the linear termas k� N. We call (a = 1/N) as avoiced. Using thisapproximation in (8), we get,

rðkÞ ¼ Ne�avoicedkXM

i¼1

A2i cosðxikÞ þ Nr2dðkÞ ð14Þ

In this form, r(k) corresponds to a sum of multipledecaying exponential sequences and its z-transformtakes up the form

SxxðzÞ ¼XM

m¼1

NA2i ð1� e�2aÞ

2

� 1

ðz� e�amþjxmÞðz�1 � e�am�jxmÞ

�

þ 1

ðz� e�am�jxmÞðz�1 � e�amþjxmÞ

�þ Nr2

where am ¼ avoiced ¼ 1=N 8m 2 ½1 �M �ð15Þ

−5 −4 −3 −2 −1 0 1 2 3 4 5−1

−0.5

0

0.5

1

4

Real Part

Imag

inar

y P

art

0 20 40 60 80 100 120 140 160 180 2000

2

4

6

Discrete frequency index

Mag

nit

ud

e re

spo

nse

0 20 40 60 80 100 120 140 160 180 2000

20

40

60

Discrete frequency index

FF

T m

agn

itu

de clean

noisyenhanced

Fig. 3. A example of a two-formant vocal-tract frequencyresponse which is excited by white noise to synthesize unvoicedspeech.


3.2. Unvoiced speech

We model unvoiced speech s(n) as the output ofan all pole transfer function (whose poles are atz ¼ e�aunvoiced

i xi ) excited by a white noise signale(n). Specifically,

SðzÞ ¼ EðzÞQQi¼1 z� e�aunvoiced

i þjxi

� �z� e�aunvoiced

i �jxi

� �ð16Þ

where S(z), E(z) are the z-transforms of unvoicedspeech signal s(n) and white noise excitation signale(n) respectively. Then it can be shown that theautocorrelation coefficients of the unvoiced speechare also decaying exponentials (Haykin, 1993, p.118), i.e.,

runvoicedðkÞ ¼XQ

i¼1

e�aunvoicedi k cosðxikÞ; ð17Þ

where the decaying factor aunvoicedi > avoiced ¼ 1=N

(where N is the block length). This is due to the factthat voiced speech has sharper spectral peaks thanthe unvoiced speech. Consequently the autocorrela-tion coefficients of the unvoiced speech decay muchfaster than those of the voiced speech. However, thefunctional form for the autocorrelation coefficientsof the voiced and unvoiced speech is the same,except that avoiced < aunvoiced. In presence of whitenoise, the power spectral density of the noisy un-voiced speech segment is given by

SxxðzÞ ¼XQ

i¼1

NA2i ð1� e�2aiÞ

2

� 1

ðz� e�aiþjxiÞðz�1 � e�ai�jxiÞ

�

þ 1

ðz� e�ai�jxiÞðz�1 � e�aiþjxiÞ

�þ Nr2 ð18Þ

where ai is a decay factor of the ith pole pair. Wenote that the functional form of the power spectraldensities in (18) and (15) are the same except that ai

in (18) will in general be greater than avoiced in (15).Therefore the functional form of the LeSF filter w in(12) remains the same for both voiced and unvoicedspeech. Its just that for the unvoiced speech thebandwidth of the pass-bands of the LeSF will bewider than that of voiced LeSF. In Fig. 3, we showa transfer function with two complex-pole pairs (atconjugate symmetric positions) that is used to syn-thesize unvoiced speech by exciting it with white

noise. First pane shows the pole–zero plot. Secondpane shows the frequency response of this all-polemodel. In the third pane, blue, red and green curvesare the FFT magnitudes of the clean speech, noisyspeech corrupted by white noise at SNR �3 dBand the LeSF enhanced speech respectively. Thefact that the green curve matches the blue curveclosely, shows that the LeSF has been able to filterout the noise component.

3.3. Analytic form of LeSF

From now onward we will not make any distinc-tion between the exponential decay factors avoiced

and aunvoiced as the functional form of the equationsremain the same. Therefore the following discussionis valid for both voiced speech and unvoiced speech.

To be able to use the general form of the solutionof the LeSF filter as in (12), we need a pole–zeromodel of the input autocorrelation in the form asdescribed in (11). Under the approximation thatthe decaying exponentials are widely spaced alongthe unit circle, the power spectrum Sxx(z) in (15)that consists of sum of certain terms can be

4 Due to self cancelling positive and negative half periods of asinusoid.


approximated by a ratio of the product of terms (ofthe form (z � eq+jh)), leading to a rational ‘z’-trans-form. Specifically, as explained in (Satorius et al.,1978; Anderson et al., 1983) and making the follow-ing assumptions:

• The pole pairs in (15) lie sufficiently close to theunit circle (easily satisfied as a ’ 0).

• All the frequency pairs (xi,xj) in (15) are suffi-ciently separated from each other such that theircontribution to the total power spectrum do notoverlap significantly.

The z transform of the total input can beexpressed as

SxxðzÞ ¼ r2

QMm¼1ðz� e�bmþjxmÞðz� eþbmþjxmÞQMm¼1ðz� e�amþjxmÞðz� eþamþjxmÞ

� ðz� eþbm�jxmÞðz� e�bm�jxmÞðz� eþam�jxmÞðz� e�am�jxmÞ

where am ¼ 1=N ð19Þ

Corresponding to each of the sinusoidal componentin the input signal there are four poles at locationsz ¼ eaxm and there are four zeros on the same ra-dial lines as the signal poles but at different distancesaway from the unit circle. Using the general solutiondescribed in (12), which has been derived at lengthin (Anderson et al., 1983), the solution of the LeSFweight vector to the present problem is

wðnÞ ¼XM

m¼1

ðBme�bmn þ CmeþbmnÞ cos xmðnþ P Þ ð20Þ

The values of bm, Bm and Cm can be determinedby substituting (20) and (14) in (9). The lthequation in the linear-system described in (9)has terms with coefficients exp(�bml), exp(+bml),exp(�al)cos(xm(l + P)) and exp(al)cos(xm(l + P)).Besides these, there are two other kind of terms thatcan be neglected. The detailed analytic solution ispresented in Appendix A. We note that the analyticsolution for the filter weights w(n) has been devel-oped only for the special when the noise is white.Unfortunately, it is not possible to derive a closedform (analytic) solution of the filter weights w(n)for non-white noises. However, as the filter weightw(n) is a continuous function of the noise autocorre-lation coefficients (A.6) and the white noise is thelimiting case of the broad-band noise when thebandwidth becomes infinite (or equal to the Nyquistfrequency for the discrete systems, as is the case

here), we expect the following. The filter weightw(n) for a non-white and broad-band noise willapproximately follow a behavior similar to the caseof the w(n) when the noise is white. However, if thenoise is significantly narrowband, then this discus-sion does not hold true.

• ‘‘Non-stationary’’ terms that are modulated by asinusoid at frequency 2xm where m 2 [1, M]. Forxm 5 0, xm 5 p, their total contribution isapproximately zero.4

• Interference terms that are modulated by a sinu-soid at frequency Dx = (xi � xj), where(i, j) 2 [1, . . . ,M]. If filter length L� 2p/Dx,these interference terms approximately sum upto zero and hence can be neglected.

The coefficients of the terms exp(�bml),exp(+bml) are the same for each of the L equationsand setting them to zero leads to just one equationwhich relates bm to a and the SNR. Let qi denotethe ‘‘partial’’ SNR of the sinusoid at frequency xi,i.e., qi ¼ A2

i =r2 and the complementary signal

SNR be denoted as ci ¼ ðPM

m¼1;m6¼iA2i Þ=r2. Then we

have the following relation:

cosh bi ¼ cosh aþ qi

2ci þ qi þ 2sinh a ð21Þ

There are two interesting cases. First case is whenthe sinusoid at frequency xi is significantly strongerthan other sinusoids such that ci is quite low. This isillustrated in Fig. 4, where we plot the bandwidth bi

of the LeSF’s pass-band that is centered around xi

as a function of the partial SNR of the ith sinusoid,qi. The complementary signal’s SNR is quite low atci = �6.99 dB. We plot curves for different ‘‘effec-tive’’ input sinusoid’s bandwidth a. From (15), wenote that a is reciprocal of frame length N. The ver-tical line in Fig. 4 corresponds to the case whenqi = ci. We note that for a given partial SNR qi,the LeSF bandwidth becomes narrower as the framelength N increases, indicating a better selectivity ofthe LeSF filter.

In Fig. 5, we plot the bandwidth bi as a functionof qi for the cases when complementary signal SNRis high at ci = 10 dB and is low at ci = �6.99 dB.The two dots correspond to the case when qi = ci.We note that ci = 10 dB corresponds to a signal

−50 −40 −30 −20 −10 0 10 20 30−30

−25

−20

−15

−10

−5

0

Partial SNR of a sinusoid in db

Ban

dw

idth

of

the

filt

er in

db

Complementary signal SNR= 10

Complementary signal SNR= −6.99

Fig. 5. Plot of the filter bandwidth bi centered around frequencyxi as a function of partial sinusoid SNR qi for given comple-mentary signal SNRs ci = �6.99 dB, 10 dB, respectively. The‘‘effective’’ input bandwidth a(a) = 0.01 for both the curves. Thetwo dots correspond to the cases when the partial SNR qi is equalto complementary signal SNR ci.

−50 −40 −30 −20 −10 0 10 20 30−30

−25

−20

−15

−10

−5

0

Partial SNR of a sinusoid in db, complementary signal SNR= −6.99

Ban

dw

idth

of

the

filt

er in

db

alpha=0.01

alpha=0.005

alpha=0.001

Fig. 4. Plot of the filter bandwidth bi centered around frequencyxi as a function of partial sinusoid SNR qi for a givencomplementary signal SNR ci = �6.99 dB and ‘‘effective’’ inputbandwidth a(a) = 0.01, 0.005, 0.001 respectively. The vertical linemeets the three curves when qi = ci.

0 0.1 0.2 0.3 0.4 0.5−30

−25

−20

−15

−10

−5

0

Normalized frequency

Mag

nit

ud

e re

spo

nse

(in

db

)

SNR=3

SNR=−7

SNR=−17


with high overall SNR.5 Therefore the cross-overpoint (ci = qi) for low ci occurs at narrower band-width as compared to high ci case. This is so becausein the former case the overall signal SNR is low andthus the LeSF filter has to have narrower pass-bands to reject as much of noise as possible.

5 As overall SNR of the signal = 10log10ð1010ci þ 1010qi Þ.

Bi and Ci in (20) are determined by equating theirrespective coefficients. The ‘‘non-stationary’’ inter-ference terms between all of the pairs of the fre-quency (xi,xj), can be neglected if (xi � xj)� 2p/L. This requires that LeSF’s frequency resolution(2p/L) should be able to resolve the constituentsinusoids

Bi ¼2e�bi e�aP ðaþ biÞ

2ðbi � aÞððaþ biÞ

2 � e�2biLðbi � aÞ2Þ

Ci ¼2e�bið2Lþ1Þþ1e�aP ðaþ biÞðbi � aÞ2

ððaþ biÞ2 � e�2biLðbi � aÞ2Þ

ð22Þ

We note from (21) that the various sinusoids arecoupled with each other through the dependenceof their bandwidth bi on the complementary signalSNR ci. As a consequence of that Bi,Ci are also indi-rectly dependent on the powers of the other sinu-soids through bi.

In Fig. 6, the magnitude response of the LeSF fil-ter is plotted for various SNR. The input in this caseconsist of three sinusoids at normalized frequencies(0.1, 0.2, 0.4). The frame length is N = 500 and filterlength is (L = 100). As the signal SNR decreases,the bandwidth of the LeSF filter starts to decreasein order to reject as much of noise as possible.The LESF filter’s gain decreases with decreasingSNR. Similar results were reported in (Andersonet al., 1983; Zeidler et al., 1978) for the case ofstationary inputs.

In Fig. 7, we plot the spectrograms of a cleanspeech utterance. Figs. 8 and 9 display the same

Fig. 6. Plot of the magnitude response of the LeSF filter as afunction of the input SNR. The input consists of three sinusoidsat normalized frequencies (0.1, 0.2, 0.4) with relative strength(1:0.6:0.4) respectively.

Fig. 7. Clean spectrogram of an utterance from the OGINumbers95 database.

Fig. 8. Spectrogram of the utterance corrupted by white noise at6 dB SNR.

Fig. 9. Spectrogram of the noisy utterance (white noise)enhanced by a (L = 100) tap LeSF filter that has been estimatedover blocks of length (N = 500).


utterance embedded in white noise at SNR 6 dB andits LeSF enhanced version, respectively. As can besee from these spectrograms, LeSF has been ableto reject the additive white noise to a large extentwhile retaining most of the speech signal. Figs. 10and 11 display the same utterance embedded inF16-cockpit noise at SNR 6 dB and its LeSFenhanced version respectively. As can be seen fromthe spectrograms, except for the narrow-band noisecomponent centered around 2.5 kHz, the LeSF filterhas been able to reject significant amount of addi-tive F-16 cockpit noise (Varga et al., 1992) fromthe noisy speech signal. By design, the LeSF putspass-bands around those regions of the input sig-nal’s spectrum that has high spectral energy density.

Fig. 11. Spectrogram of the noisy utterance (F16-cockpit noise)enhanced by a (L = 100) tap LeSF filter that has been estimatedover blocks of length (N = 500).

Fig. 10. Spectrogram of the utterance corrupted by F16-cockpitnoise at 6 dB SNR.

−40 −30 −20 −10 0 10 20 30

6

8

10

12

14

16

18

20

22

24

Input SNR in dB.

LeS

F g

ain

in d

B

L=100

L=80

L=60

Fig. 12. LeSF gain plotted as a function of input SNR for fixedblock length N = 500 and various filter lengths L = 100, 80, 60.

−40 −30 −20 −10 0 10 20 30

6

8

10

12

14

16

18

20

22

24

LeS

F g

ain

in d

B

N=500

N=400

N=300


As a consequence of that, the LeSF has not beenable to well attenuate the narrow band noise inthe spectrograms of Figs. 10 and 11. However, ithas attenuated the broad-band noise componentsquite well.

4. Gain of the LeSF filter

The LeSF filter output consist of filtered sinu-soids and the filtered noise signal. For the inputsignal described by (7) that is filtered by a LeSFfilter with coefficients as in (20), the output filteredsignal power Psignal and the output filtered noisepower Pnoise are approximately6 given by

P signal ¼XL�1

i¼0

XL�1

j¼0

wðiÞwðjÞrðji� jjÞ ð23Þ

¼XL�1

i¼0

XL�1

j¼0

wðiÞwðjÞ ðN � ji� jjÞN

�XM

m¼1

A2m cosð2pfmði� jÞÞ ð24Þ

P noise ¼XL�1

n¼0

r2w2ðnÞ

¼XM

i¼1

½B2i e�biL þ C2

i ebiL� r2 sinhðbiLÞ

2biþ r2BCL

ð25Þ

The output SNR of the LeSF filter is given by (23)divided by (25). The LeSF gain is given by the ratioof the output SNR to the input SNR ¼PM

i¼1A2i =ð2r2Þ.

In Fig. 12, we plot the LeSF broadband gain as afunction of input SNR, for a fixed block length ofN = 500 and varying filter lengths (L = 100, 80,60). As can be noted from Fig. 12, the LeSF broad-band gain approaches a horizontal asymptote fordecreasing input SNR. This is in agreement withthe fact that the bandwidth bi of the LeSF filterapproaches a horizontal asymptote (Fig. 4) fordecreasing SNR. We note from Fig. 12 that theLeSF broadband gain increases as the filter lengthL increases for a fixed block length N = 500. How-ever, since the L tap LeSF filter is estimated usingthe N samples from a given block, the filter lengthcannot be increased arbitrarily and is limited fromabove by the block length ‘N’.

6 Assuming N� L such that initial L samples can be used toinitialize the filter.

In Fig. 13, we plot the LeSF broadband gain as afunction of input SNR, for a fixed filter lengthL = 100 and varying block lengths (N = 300, 400,500). We note that the LeSF broadband gainincreases as the block length N increases. However,for a non-stationary signal such as speech, as theblock length increases, the corresponding powerspectrum will become more broad-band. Thereforewe will not be able to model the correspondingblock as a sum of a small number of sinusoids M

as done in (7). As a result the number of sinusoidsM will be large and possibly closely spaced to eachother, leading to significant interference termsbetween the constituent sinusoids in (8) and (14).

Input SNR in dB

Fig. 13. LeSF gain plotted as a function of input SNR for fixedfilter length L = 100 and various block lengths N = 300, 400, 500.

Table 1Word error rate results for factory noise using soft-decisionspectral subtraction

SNR LeSF MFCC POST-FILT POWER-FILT PSIL

Clean 6.6 8.1 8.3 7.1


5. Experiments and results

In order to assess the effectiveness of the pro-posed algorithm, speech recognition experimentswere conducted on the OGI Numbers (Cole et al.,1994) corpus. This database contains spontaneouslyspoken free-format connected numbers over a tele-phone channel. The lexicon consists of 31 words.7

Hidden Markov Model and Gaussian MixtureModel (HMM–GMM) based speech recognitionsystems were trained using public domain softwareHTK (Young et al., 1995) on the clean training setfrom the original Numbers corpus. The system con-sisted of 80 tied-state triphone HMM’s with 3 emit-ting states per triphone and 12 mixtures per state.To verify the robustness of the features to noise,the clean test utterances were corrupted usingWhite, Factory and F-16 cockpit noise from theNoisex92 (Varga et al., 1992) database.

5.1. Bulk delay P

Noting that the autocorrelation coefficients of aperiodic signal are themselves periodic with thesame period (hence they do not decay with theincreasing lag), Sambur (1978) has used a bulk delayequal to the pitch period of the voiced speech for itsenhancement. However, for the un-voiced speech ahigh bulk delay will result in a significant distortionby the LeSF filter as its autocorrelation coefficientsdecay much more rapidly than those of the voicedspeech. Therefore, we kept the bulk delay at‘P = 1’ as a good choice for enhancing both thevoiced and un-voiced speech frames.

5.2. Block length N and filter length L

Speech signals were blocked into frames of(N = 500) samples (62.5 ms) each and a (L = 100)tap LeSF filter was derived using (4), through theLevinson–Durbin algorithm, for each frame thatcould be either voiced or unvoiced. The relativelyhigh order (L = 100) of the LeSF filter is requiredfor a twofold reason. Firstly, it provides sufficientlyhigh frequency resolution (2p/L) to resolve theconstituent sinusoids in case of the voiced speech.Secondly, it causes de-correlation between the noiseappearing at the kth filter tap (u(n � k � P + 1),

7 With confusable words such as nine, ninety and nineteen,eight, eighty and eighteen and so forth.

k � L) and the noise component of the reference sig-nal u(n). Each speech frame was then filteredthrough its corresponding LeSF filter to derive anenhanced speech frame. Finally MFCC feature vec-tor was computed from the enhanced speech frame.These enhanced LeSF-MFCC were compared tothe baseline MFCC features and noise robustCJ-RASTA-PLP (Hermansky and Morgan, 1994)features. The MFCC feature vector computation isthe same for the baseline and the LeSF-MFCCfeatures. The only difference is that the MFCC base-line features are computed directly from the noisyspeech while the LeSF-MFCC features arecomputed from LeSF enhanced speech signal. Wealso compared our technique with the soft-decisionspectral subtraction based technique. In Lathoudet al. (2005), authors have used a speech presenceprobability in conjunction with spectral subtractionto achieve noise robustness. This can be seen as asoft-decision spectral subtraction which has beenshown to be superior than hard-decision spectralsubtraction by McAulay and Malpass (1980). Asthe train set, test set and the factory noise environ-ment in our experiments and those of Lathoud et al.(2005) are the same, we quote the ASR results forthe factory noise directly from Lathoud et al.(2005). The authors in (Lathoud et al., 2005)propose three features based on the soft-spectralsubtraction, which vary in their pre-processing stepsand are termed POST-FILT, POWER-FILT andPSIL. In Table 1, we quote their ASR word errorrates in the factory noise environment, directly fromthe results reported in (Lathoud et al., 2005). Wenote that the proposed technique outperforms allthree soft-decision spectral subtraction variants.

The speech recognition results for the baselineMFCC, CJ-RASTA-PLP and the proposed LeSF-MFCC, in various levels of noise are given in Tables2–4. All the reported features in this paper havecepstral mean subtraction (CMS). The proposedLeSF processed MFCC performs significantly

12 dB 11.3 16.2 17.0 15.76 dB 20.0 30.7 31.2 28.70 dB 41.3 63.1 61.9 58.2

All features have cepstral mean subtraction.

Table 2Word error rate results for factory noise

SNR MFCC CJ-RASTA-PLP LeSF MFCC

Clean 5.7 7.8 6.612 dB 12.3 12.2 11.36 dB 27.1 23.8 20.00 dB 71.0 59.8 41.3

Parameters of the LeSF filter, L = 100 and N = 500. All featureshave cepstral mean subtraction.

Table 3Word error rate results for white noise


Clean 5.7 7.8 6.612 dB 16.4 14.7 17.36 dB 34.6 29.0 24.10 dB 80.3 66.0 40.4


Table 4Word error rate results for F16-cockpit noise


Clean 5.7 7.8 6.612 dB 13.6 14.2 12.56 dB 28.4 25.3 21.00 dB 72.3 59.2 41.0



better than others in various kinds of heavy noiseconditions (SNR 6,0). The slight performance deg-radation of the LeSF-MFCC in the clean is due tothe fact that the LeSF filter being an all-pole filterdoes not model the valleys of the clean speech spec-trum well. As a result, the LeSF filter sometimesamplifies the low spectral energy regions of the cleanspectrum.

Table 5 shows the word error rate of the LeSFenhanced MFCC features for a fixed block length

Table 5Word error rate results for factory noise for varying length,L = 100, 50, 20 of the LeSF filter

SNR LeSF L = 20 LeSF L = 50 LeSF L = 100

Clean 9.3 7.3 6.612 dB 14.3 12.3 11.36 dB 24.4 22.0 20.00 dB 46.5 43.0 41.3

The block length, N is 500 (62.5 ms).

of 500 samples (62.5 ms) and varying LeSF filterlength ‘L’. We note that the word error ratedecreases as the filter length increases. This is sobecause a higher filter length results in a sharper fre-quency response of the LeSF filter(narrower band-width of the passbands), thereby enabling it to rejectas much of the broad-band noise as possible thatlies away from the frequencies of the constituentsinusoids of the clean signal.

6. Conclusion

We consider a class of non-stationary signals asinput that are composed of either (a) multiple sinu-soids (voiced speech) whose frequencies and theamplitudes may vary from block to block or, (b)output of an all-pole filter excited by white noiseinput (unvoiced speech segments) and which areembedded in white noise. We have derived theanalytical expressions for the impulse response ofthe L-weight least squares filter (LesF) as a functionof the input SNR (computed over the currentframe), effective band-width of the signal (due tofinite frame length), filter length ‘L’ and framelength ‘N’. Recognizing that such a time-varyingsinusoidal model (McAulay and Quatieri, 1986)and the source–filter model are a reasonableapproximation to the voiced speech and unvoicedspeech respectively, we have applied the block esti-mated LeSF filter for de-noising speech signalsembedded in the realistic (Varga et al., 1992)broad-band noise as commonly encountered on afactory floor and an aircraft cockpit. The pro-posed technique leads to a significant improve-ment in ASR performance as compared to noiserobust CJ-RASTA-PLP (Hermansky and Morgan,1994), speech presence probability based spectralsubtraction (Lathoud et al., 2005) and the MFCCfeatures computed from the unprocessed noisysignal.

Appendix A. Solution to the equation

A.1. Autocorrelation over a block of samples

We consider the signal x(n) as

xðnÞ ¼XM

i¼1

Ai cosðxinþ /iÞ þ uðnÞ ðA:1Þ

where n 2 [0, N � 1] and u(n) is a realization of whitenoise. Then the kth lag autocorrelation is given by


rðkÞ ¼XN�k�1

n¼0

xðnÞxðnþ kÞ

¼XN�k�1

n¼0

XM

i¼1

Ai cosðxinþ /iÞ þ uðnÞ !

�XM

j¼1

Aj cosðxjðnþ kÞ þ /jÞ þ uðnþ kÞ !

¼XN�k�1

n¼0

XM

i¼1

A2i cosðxinþ /iÞ cosðxiðnþ kÞ þ /iÞ

þXN�k�1

n¼0

XM

i¼1

XM

j¼1;j6¼i

AiAj cosðxinþ /iÞ cosðxjðnþ kÞ þ /jÞ

þXN�k�1

n¼0

XM

j¼1

AjuðnÞ cosðxjðnþ kÞ þ /jÞ

þXN�k�1

n¼0

XM

i¼1

Aiuðnþ kÞ cosðxinþ /iÞ

þXN�k�1

n¼0

uðnÞuðnþ kÞ!

¼XN�k�1

n¼0

XM

i¼1

A2i

2ðcosðxikÞ þ cosð2xinþ xik þ 2/iÞÞ

!

þXN�k�1

n¼0

XM

i¼1

XM

j¼1;j6¼i

AiAj

2ðcosððxi � xjÞn� xjk þ /i � /jÞ

þ cosððxi þ xjÞnþ xjðkÞ þ /i þ /jÞÞ

þXN�k�1

n¼0

XM

j¼1

AjuðnÞ cosðxjðnþ kÞ þ /jÞ

þXN�k�1

n¼0

XM

i¼1

Aiuðnþ kÞ cosðxinþ /iÞ

þXN�k�1

n¼0

uðnÞuðnþ kÞ

¼ ðN � kÞXM

i¼1

A2i

2cosðxikÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

SIG

þXM

i¼1

A2i

2

XN�k�1

n¼0cosð2xinþ xik þ 2/iÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

A

þXM

i¼1

XM

j¼1;j6¼i

AiAj

2

XN�k�1

n¼0cosððxi � xjÞn� xjk þ /i � /jÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

B

þXM

i¼1

XM

j¼1;j6¼i

AiAj

2

XN�k�1

n¼0cosððxi þ xjÞnþ xjðkÞ þ /i þ /jÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

C

þXM

j¼1

XN�k�1

n¼0AjuðnÞ cosðxjðnþ kÞ þ /jÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

D

þXM

i¼1

XN�k�1

n¼0Aiuðnþ kÞ cosðxinþ /iÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

E

þXN�k�1

n¼0uðnÞuðnþ kÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

F

ðA:2Þ

Let us consider the under-braced terms A, B, C.They are the sums of the samples of a cosine waveat frequencies, 2xi, xi � xj, xi + xj respectively. If(N � k) is much greater than the 2p

xiand 2p

xi�xjfor

all frequency pairs (i, j), then the sums A, B, C willcontain several periods of their corresponding co-sine waves. Sum over ‘Q’ periods of the samples ofa cosine wave at any non-zero frequency is zero.This can be seen by the following integral whichapproximates the sum of the samples in A, B, C,Z t¼Q2p=x

t¼0

A cosðxt þ /Þdt ¼ 0 ðA:3Þ

This happens as the negative and positive swings ofthe cosine cancel each other. Therefore A, B, C areapproximately zero. Let (N � k) = Q · T + D,where Q is an integer and T is the period of a certainsinusoid at frequency x and D is the left over part asN � k is not an exact multiple of T. Hence D < T.Then we haveXn¼N�k

n¼0

A cosðxnþ /Þ

¼Xn¼QT

n¼0

A cosðxnþ /Þ þXn¼N�k

n¼QTþ1

A cosðxnþ /Þ

¼Xn¼N�k

n¼QTþ1

A cosðxnþ /Þ < A� D� A� ðN � kÞ

ðA:4Þ

This proves the A, B, C can safely be ignored incomparison to SIG term which is proportional to(N � k). Moreover D, E are also approximately zeroas the noise u(n) is assumed uncorrelated with signals(n). The term F = Nr2d(k) as the noise is assumedto be white. Hence ignoring A, B, C, D, E, we get

rðkÞ ¼XN�k�1

n¼0

xðnÞxðnþ kÞ

’XM

i¼1

ðN � kÞA2i cosð2pfikÞ þ Nr2dðkÞ

’XM

i¼1

ðN expð�akÞÞA2i cosð2pfikÞ þ Nr2dðkÞ

ðA:5Þwhere a = 1/N and hence a� 1.

A.2. Solving the least squares matrix equation

In the section, we will analytically solve the LeSFequation for the form of the autocorrelation coeffi-


cients given above in (A.5). The (L · L) matrixLeSF equation is reproduced below

rð0Þ rð1Þ rð2Þ � � � rðL� 1Þrð1Þ rð0Þ rð1Þ � � � rðL� 2Þ� � � � � � � � � � � � � � �

rðL� 1Þ rðL� 2Þ � � � � � � rð0Þ

0BBB@

1CCCA

�

wð0Þwð1Þ� � �

wðL� 1Þ

0BBB@

1CCCA ¼

rðP ÞrðP þ 1Þ� � �

rðP þ L� 1Þ

0BBB@

1CCCA ðA:6Þ

where the w(0),w(1), . . . ,w(L � 1) are the LeSF filtertap weights and the autocorrelation coefficients r(k)are given by (A.5). In (Anderson et al., 1983), it hasbeen shown that the functional form of filter tapweights in (A.6) for the form of the r(k) as in(A.5), is given by

wðkÞ¼XM�1

i¼1

ðBi expð�bikÞþCi expðbikÞÞcosðxiðkþPÞÞ

ðA:7Þ

Xp�1

k¼0

½Bi expð�bikÞ þ Ci expðbikÞ� cosðxiðk þ PÞÞ½A2i expð�að

þ ½Bi expð�bipÞ þ Ci expðbipÞ� cosðxiðp þ P ÞÞXM

i¼1

A2i þ

"

þ Ci expðbikÞ� cosðxiðk þ PÞÞ½A2i expð�aðk � pÞÞ cosðxi

¼ 1

2A2

i cosðxiðp þ P ÞÞXk¼p�1

k¼0½Bi expð�bikÞ þ Ci expðbi|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl

stationary

þ 1

2A2

i

Xk¼p�1

k¼0½Bi expð�bikÞ þ Ci expðbikÞ� expð�aðp|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl

non-stationary

þ ½Bi expð�bipÞ þ Ci expðbipÞ� cosðxiðp þ PÞÞXM

i¼1

�|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl

stationary

� 1

2A2

i cosðxiðp þ P ÞÞXk¼L�1

k¼pþ1½Bi expð�bikÞ þ Ci expð|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl

stationary

þ 1

2A2

i

Xk¼L�1

k¼pþ1½Bi expð�bikÞ þ Ci expðbikÞ� expð�aðk|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl

non-stationary

where P is the bulk delay. In (A.7), the quantities Ci,Bi, bi are unknown. Our objective is to solve (A.6)for the unknown quantities in the filter tap weightsw in closed form. Towards this end, let us considerthe (p + 1)th equation in the system of Eq. (A.6)which is reproduced below

Xk¼p�1

k¼0

rðp � kÞwðkÞ þ rð0ÞwðpÞ þXL�1

k¼pþ1

rðk � pÞwðkÞ

¼ rðP þ pÞ ðA:8Þ

Next, we substitute the functional forms of r(k) andw(k) in (A.8). We collect the terms that correspondto the ith sinusoid together and call them as ‘‘self-terms’’ while the terms that have contribution fromthe ith and the jth sinusoid are called ‘‘cross terms’’.As, we will show that some of these cross terms canbe ignored in comparison to the ‘‘self-terms’’, thusfacilitating analytical solution. The ‘‘self-terms’’due to the ith sinusoid, on the right-hand side of(A.8) are,

p � kÞÞ cosðxiðp � kÞÞ�

r2

# XL�1

k¼pþ1

½Bi expð�bikÞ

ðk � pÞÞ�

kÞ� expð�aðp � kÞÞfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}� kÞÞ cosð2xik � xiðp � P ÞÞfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}A2

i þ r2�

fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}bikÞ� expð�aðk � pÞÞfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}� pÞÞ cosð2xik � xiðp � P ÞÞfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} ðA:9Þ


The ‘‘non-stationary’’ terms approximately sum upto zeros due to the self-canceling positive and nega-tive swings of the sinusoid at frequency (2xi) andhence can be ignored in comparison to the ‘‘station-ary terms’’. Similarly in the (p + 1)th equation, thereare ‘‘cross-terms’’ that get contribution from the ithand the jth sinusoid. These terms are given below

Xp�1

k¼0

½Bi expð�bikÞ þ Ci expðbikÞ� cosðxiðk þ P ÞÞ½A2j expð�aðp � kÞÞ cosðxjðp � kÞÞ�

þXL�1

k¼pþ1

½Bi expð�bikÞ þ Ci expðbikÞ� cosðxiðk þ PÞÞ½A2j expð�aðk � pÞÞ cosðxjðk � pÞÞ�

¼ 1

2A2

j

Xk¼p�1

k¼0½Bi expð�bikÞ þ Ci expðbikÞ� expð�aðp � kÞÞ½cosððxi � xjÞðkÞ þ xiP þ xjpÞ�|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

non-stationary

þ 1

2A2

j

Xk¼p�1

k¼0½Bi expð�bikÞ þ Ci expðbikÞ� expð�aðp � kÞÞ½cosððxi þ xjÞðkÞ þ xiP � xjpÞ�|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

non-stationary

þ 1

2A2

j

Xk¼L�1

k¼pþ1½Bi expð�bikÞ þ Ci expðbikÞ� expð�aðk � pÞÞ½cosððxi � xjÞk þ xiP þ xjpÞ�|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

non-stationary

þ 1

2A2

j

Xk¼L�1

k¼pþ1½Bi expð�bikÞ þ Ci expðbikÞ� expð�aðk � pÞÞ½cosððxi þ xjÞðkÞ þ xiP � xjpÞ�|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

non-stationary

ðA:10Þ

If ðxi � xjÞ � 2pL and (xi + xj) 5 2p, then these

‘‘non-stationary’’ terms approximately sum up tozero too. Therefore, all the cross-terms noted abovecan be safely neglected on the right-hand side of theEq. (A.8). Therefore neglecting all the non-station-ary terms in (A.9) and (A.10), we get,

1

2A2

i cosðxiðp þ PÞÞXk¼p�1

k¼0½Bi expð�bikÞ þ Ci expðbikÞ� ex|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl

stationary

þ ½Bi expð�bipÞ þ Ci expðbipÞ� cosðxiðp þ P ÞÞXM

i¼1A2

i

�|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl

stationary

þ 1

2A2

i cosðxiðp þ P ÞÞXk¼L�1

k¼pþ1½Bi expð�bikÞ þ Ci expðbi|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl

stationary

¼XM

i¼1

expð�aðp þ P ÞÞA2i cosðxiðp þ P ÞÞ

Next, we collect all the terms in (A.11) with the coef-ficients exp(�bip), exp(+bip) for each of the ith sinu-soid and set them to zero as there are no terms onthe right-hand side of (A.11) with these coefficients.Consider the terms with the coefficient exp(�bip),which are given below, and is set to zero asexplained above

XM

i¼1

�A2i

2

cosðxiðpþPÞÞBi expð�bipÞð1�expð�ðbi�aÞÞÞ

"

þBi expð�bipÞcosðxiðpþPÞÞXM

i¼1

A2i þr2

!#

þXM

i¼1

A2i

2

Bi expð�bipÞexpð�ðaþbiÞÞcosðxiðpþPÞÞð1�expð�ðaþbiÞÞÞ

� �¼ 0

ðA:12Þ

pð�aðp � kÞÞfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}þ r2

�fflfflfflfflfflfflfflfflfflfflfflffl}kÞ� expð�aðk � pÞÞfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

ðA:11Þ


Therefore for each ‘‘i’’, we get the relationship,

cosh bi ¼ cosh aþ qi

2ci þ qi þ 2sinh a ðA:13Þ

where, qi denotes the ‘‘partial’’ SNR of the sinusoidat frequency xi, i.e., qi ¼ A2

i =r2 and the complemen-

tary signal SNR is denoted as ci ¼PMm¼1;m6¼iA

2i

� �=r2q. The coefficients of the terms ex-

p(�bip), exp(+bip) are the same for each of the L

equations and setting them to zero leads to justone equation which relates bi to a, qi and ci. Nextstep is to solve for Bi and Ci. Towards this end,we equate the coefficient of exp(�ap) on both theleft and right-hand sides of (A.11). This leads to:

A2i cosðxiðp þ P ÞÞBi expð�apÞ

1� expð�ða� biÞÞ

þ A2i cosðxiðp þ P ÞÞCi expð�apÞ

1� expð�ðaþ biÞÞ¼ A2

i expð�apÞ expð�aPÞ cosðxiðp þ P ÞÞ ðA:14Þ

Similarly we set the coefficient of exp(+ap) to zeroas there is no term with this coefficient on theright-hand side of (A.11). This leads to,

A2i cosðxiðp þ PÞÞBi expð�ðaþ biÞÞ expð�aLþ ap þ a� biLþ biÞ

1� expð�ðaþ biÞÞ

þ A2i cosðxiðp þ P ÞÞCi expð�aþ biÞ expð�aLþ ap þ aþ biL� biÞ

1� expð�aþ biÞ¼ 0

ðA:15Þ

There are two unknowns Bi and Ci in (A.14) and(A.15). Solving them simultaneously gives

Bi ¼2e�bi e�aP ðaþ biÞ

2ðbi � aÞððaþ biÞ

2 � e�2biLðbi � aÞ2Þ

Ci ¼2e�bið2Lþ1Þþ1e�aP ðaþ biÞðbi � aÞ2

ððaþ biÞ2 � e�2biLðbi � aÞ2Þ

ðA:16Þ

This concludes the analytic solution of the LeSFequation.

References

Anderson, C.M., Satorius, E.H., Zeidler, J.R., 1983. Adaptiveenhancement of finite bandwidth signals in white gaussiannoise. IEEE Trans. ASSP ASSP-31 (1).

Bershad, N., Feintuch, P., Reed, F., Fisher, B., 1980. Trackingcharacteristics of the LMS adaptive line-enhancer – response

to a linear chirp signal in noise. IEEE Trans. ASSP ASSP-28,504–517.

Cole, R.A., Fanty, M., Lander, T., 1994. Telephone speechcorpus at CSLU. In: Proc. ICSLP, Yokohama, Japan.

Compton, R., 1980. Pointing accuracy and dynamic range in asteered beam antenna array. IEEE Trans. Aerosp. Electron.Systems AES-16, 280–287.

Ephraim, Y., Malah, D., 1984. Speech enhancement using aminimum mean square error short-time spectral magnitudeestimator. IEEE Trans. ASSP ASSP-32, 1109–1121.

Ephraim, Y., Malah, D., 1985. Speech enhancement using aminimum mean square error log spectral amplitude estimator.IEEE Trans. ASSP ASSP-33 (2).

Frost, O.L., 1972. An algorithm for linearly constrained adaptivearray processing. Proc. IEEE 60, 926–935.

Gersho, A., 1969. Adaptive equalization of highly dispersivechannels for data transmission. Bell Systems Tech. J. 48, 55–70.

Gibson, C., Haykin, S., 1980. Learning characteristics of adaptivelattice filtering algorithms. IEEE Trans. ASSP ASSP-28, 681–692.

Griffiths, L.J., 1969. A simple adaptive algorithm for realtime processing in antenna arrays. Proc. IEEE 57, 1696–1704.

Haykin, S., 1993. Adaptive Filter Theory. Prentice-Hall Publish-ers, NJ, USA.

Hermansky, H., Morgan, N., 1994. Rasta processing of speech.IEEE Trans. SAP 2 (4).

Kim, H.K., Rose, R.C., 2003. Cepstrum domain acoustic featurecompensation based on decomposition of speech and noisefor ASR in noisy environments. IEEE Trans. SAP 11 (5).

Lathoud, G., Doss, M.M., Mesot, B., 2005. A spectrogram modelfor enhanced source localization and noise robust ASR. In:Proc. Eurospeech, Lisbon, Portugal.

Marple, L., 1981. Efficient least squares FIR system identifica-tion. IEEE Trans. ASSP ASSP-29, 62–73.

Martin, R., 2001. Noise power spectral density estimation basedon optimal smoothing and minimum statistics. IEEE Trans.SAP 9 (5).

McAulay, R.J., Malpass, M.L., 1980. Speech enhancement usinga soft-decision noise suppression filter. IEEE Trans. ASSPASSP-28 (2).

McAulay, R.J., Quatieri, T.F., 1986. Speech analysis/synthesisbased on a sinusoidal representation. IEEE Trans. ASSPASSP-34 (4).

Rabiner, L., Crochiere, R., Allen, J., 1978. FIR system modelingand identification in the presence of noise and band-limitedinputs. IEEE Trans. ASSP ASSP-26, 319–333.

Sambur, M.R., 1978. Adaptive noise canceling for speech signals.IEEE Trans. ASSP ASSP-26 (5).

Satorius, E., Alexander, S.T., 1979. Channel equalization usingadaptive lattice algorithms. IEEE Trans. Comm. 27, 899–905.

Satorius, E., Pack, J., 1981. Application of least squares latticealgorithms for adaptive equalization. IEEE Trans. Comm.COM-29, 136–142.

Satorius, E., Zeidler, J., Alexander, S., 1978. Linear predictivedigital filtering of narrowband processes in additive broad-band noise, Naval Ocean Systems Center, San Diego, CA,Technical Report 331, November 1978.

Sondhi, M., Berkley, D., 1980. Silencing echoes on the telephonenetwork. Proc. IEEE 68, 948–963.


Varga, A., Steeneken, H., Tomlinson, M., Jones, D., 1992. TheNOISEX-92 study on the effect of additive noise on automaticspeech recognition. Technical Report, DRA Speech ResearchUnit, Malvern, England.

Widrow, B. et al., 1975. Adaptive noise cancelling: principles andapplications. Proc. IEEE 65, 1692–1716.

Young, S., Odell, J., Ollason, D., Valtchev, V., Woodland, P.,1995. The HTK Book. Cambridge University.

Zeidler, J.R., Satorius, E.H., Chabries, D.M., Wexler, H.T.,1978. Adaptive enhancement of multiple sinusoids in uncor-related noise. IEEE Trans. ASSP ASSP-26 (3).

Documents

Least squares filtering of speech signals for robust ASR