6
Speech Signal Restoration Using an Optimal Neural Network Structure Xiao-Ming Gao I, Seppo J. Ovaska *, and Iiro 0. Hartimo ' ' Helsinki University of Technology Laboratory of Signal Processing and Computer Technology Otakaari 5A, FIN-02150 Espoo, FINLAND Helsinki University of Technology Laboratory of Telecommunications Technology Otakaari SA, FIN-02 IS0 Espoo, FINLAND ABSTRACT In this paper, we propose an optimal neural network-based method for noisy speech restoration. The method uses a feedforward neural network with one hidden layer as a nonlinear predictive filter. In order to select the 09- timal network structure, we apply the Predictive Minimum Description Length (PMDL) principle to determine the optimal number of input and hidden nodes. In this way, the possible owr-jitting and under-fitting problem can be penalized automatically. This results in a computationally efficient network structure with both excellent noise attenuation and generalization capabilities. 1. Introduction Rapidly developing communications networks and multimedia require reliable transmission ancl reproduction of audio and speech signals. However, in many cases the speech signals received may be partly deteriorated. This is mainly because the transmission channels often introduce additive noise to the signals transmitted or the re- corded speech to be sent is already distorted (for example, old interviews and news reports in noisy environmenl). It is naturally of paramount importance to separate the speech component from the background noise and impmire its quality before any coding or reproduction. There exist some advanced nonlinear methods for speech signal enhancement and noise reduction. For exam- ple, a biologically motivated robust signal processing method was introduced in [4]. A noise reduction network (NRN) reduced the noise in a speech signal by analyzing its spectro-temporal structure [ 101. This NRN method c m remove most of the additive noise, but it also attenuates many high frequency components in the actual speech. The main problem with the application of the neural network-based method to filtering the noise froin the cor- rupted speech signals is the selection of the network complexity. The goal is to select the optimal coniplexity of the network structure so that the network can remove the noise without distorting the original speech signal. In this paper, we address this problem by using the PMDL principle to determine the optimal number of input and hidden nodes. The paper is organized in the following manner. In section 2, we give a brief summary of the efficient PMDL principle. Section 3 presents the application of the PMDL principle to tlhe selection of the optimal network struc- ture. In Section 4, we present a successful demonstration study of our new method in restoration of a noisy speech signal. Section S concludes this paper with a few remarks. 2. The PMDL Principle It is well known that a neural network with high complexity can maximize the mapping accurac,y giving moce precise output for the training data, but it may also give worse outputs for unseen data. On the other hand, net- works with too few parameters cannot find out the mechanism that governs the data. Therefore, we must have a trade-off between the complexity and the generalization capability of the networks. There exist several papers that discuss the selection of the number of hidden nodes 121, [6]. For example, the Akaike Information Criterion (AIC) has been used to determined the number of input and hidden nodes. However, 0-7803-3210-5196 $4.0001996 IEEE 18411

[IEEE International Conference on Neural Networks (ICNN'96) - Washington, DC, USA (3-6 June 1996)] Proceedings of International Conference on Neural Networks (ICNN'96) - Speech signal

  • Upload
    zo

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

Page 1: [IEEE International Conference on Neural Networks (ICNN'96) - Washington, DC, USA (3-6 June 1996)] Proceedings of International Conference on Neural Networks (ICNN'96) - Speech signal

Speech Signal Restoration Using an Optimal Neural Network Structure

Xiao-Ming Gao I , Seppo J. Ovaska *, and Iiro 0. Hartimo '

' Helsinki University of Technology Laboratory of Signal Processing and

Computer Technology Otakaari 5A, FIN-02150 Espoo, FINLAND

Helsinki University of Technology Laboratory of Telecommunications Technology

Otakaari SA, FIN-02 IS0 Espoo, FINLAND

ABSTRACT In this paper, we propose an optimal neural network-based method for noisy speech restoration. The method

uses a feedforward neural network with one hidden layer as a nonlinear predictive filter. In order to select the 09- timal network structure, we apply the Predictive Minimum Description Length (PMDL) principle to determine the optimal number of input and hidden nodes. In this way, the possible owr-jitting and under-fitting problem can be penalized automatically. This results in a computationally efficient network structure with both excellent noise attenuation and generalization capabilities.

1. Introduction Rapidly developing communications networks and multimedia require reliable transmission ancl reproduction

of audio and speech signals. However, in many cases the speech signals received may be partly deteriorated. This is mainly because the transmission channels often introduce additive noise to the signals transmitted or the re- corded speech to be sent is already distorted (for example, old interviews and news reports in noisy environmenl). It is naturally of paramount importance to separate the speech component from the background noise and impmire its quality before any coding or reproduction.

There exist some advanced nonlinear methods for speech signal enhancement and noise reduction. For exam- ple, a biologically motivated robust signal processing method was introduced in [4]. A noise reduction network (NRN) reduced the noise in a speech signal by analyzing its spectro-temporal structure [ 101. This NRN method c m remove most of the additive noise, but it also attenuates many high frequency components in the actual speech. The main problem with the application of the neural network-based method to filtering the noise froin the cor- rupted speech signals is the selection of the network complexity. The goal is to select the optimal coniplexity of the network structure so that the network can remove the noise without distorting the original speech signal. In this paper, we address this problem by using the PMDL principle to determine the optimal number of input and hidden nodes.

The paper is organized in the following manner. In section 2, we give a brief summary of the efficient PMDL principle. Section 3 presents the application of the PMDL principle to tlhe selection of the optimal network struc- ture. In Section 4, we present a successful demonstration study of our new method in restoration of a noisy speech signal. Section S concludes this paper with a few remarks.

2. The PMDL Principle It is well known that a neural network with high complexity can maximize the mapping accurac,y giving moce

precise output for the training data, but it may also give worse outputs for unseen data. On the other hand, net- works with too few parameters cannot find out the mechanism that governs the data. Therefore, we must have a trade-off between the complexity and the generalization capability of the networks.

There exist several papers that discuss the selection of the number of hidden nodes 121, [6]. For example, the Akaike Information Criterion (AIC) has been used to determined the number of input and hidden nodes. However,

0-7803-3210-5196 $4.0001996 IEEE 18411

Page 2: [IEEE International Conference on Neural Networks (ICNN'96) - Washington, DC, USA (3-6 June 1996)] Proceedings of International Conference on Neural Networks (ICNN'96) - Speech signal

the AIC has been proven inconsistent and it has a tendency to overfit a model [5] . In this paper, we use the effi- cient PMDL method presented by Rissanen to optimize the neural network structure.

Rissanen has proposed the shortest description code length and stochastic complexity for the model selection [7 ] , [SI. For a given distribution P ( x ) , the complexity of x is defined, relative to the coding system D , to be

z,y(x I D ) = -log* P ( x ) . (1)

For applications, the most important coding system is obtained from a class of parametric probability models:

M = {f(xlO),n(O) 18 E Q k , k = 1,2 ;.. }, (2)

where C l k is a subset of the k-dimensional Euclidean space with no-empty interior. Hence, there are k 'free' pa- rameters. The stochastic complexity of x relative to the model class M is now given by

Although the model class M includes the so-called 'prior' distribution n: , its role is not the same as in the Bay- esian inference. In fact, we need not select it at all, since we can construct a distribution n(8 Ix) proportional

to f(x 18) [9]. The stochastic complexity represents the shortest code length attainable by the given model class. For example in curve fitting and related problems, the models are not primarily represented in terms of a distribu- tion. Rather we can use a parametric predictor is the

output of the network, x = (xt ,xl-] , . . . , x ~ - ~ + ~ ) is the input vector, and 8 denotes the array of all the weight as pa-

rameters. In addition, there is a distance function 6 ( E ~ + ] ) to measure the prediction error E,+~ = xt+] -il+] where x,+~

is the target output. Such a prediction model can be immediately reduced to a probabilistic model in which the minimization of E,, ] causes the optimization of 8 . In this case, we define the conditional Gaussian distribution

= F ( x 18) as in the case of neural networks, where

where x' = (x, ;..,xt) . The density (4) is then extended to a sequence as code length

After having fixed the model class, we have the problem to estimate the shortest code length obtainable with this class of models. Let $(x') and c?(x') be written briefly as 6' and et . They are the maximum likelihood es-

timates, i.e., the parameter values that minimize the code length -lnJ'(x,+, Ixf ,8,02) for the past data. In par- ticular,

. I I

6; =-x&,2 t r = l

Therefore, the predictive code length for the data is given by

1842

Page 3: [IEEE International Conference on Neural Networks (ICNN'96) - Washington, DC, USA (3-6 June 1996)] Proceedings of International Conference on Neural Networks (ICNN'96) - Speech signal

In the PMDL methodl, the network parameters need not be encoded and they can be calculated from the past string by an algorithm. The model costs are added to the prediction errors, and the over-fitting and underfitting are penalized automatically.

3. PMDL Principle for Neural Network Optimizaition Consider a multilayered feedforward neural network with one hidden layer, as shown in Fig. 1. Suppose also

that the neural network predictor has p input nodes x ( t > , x ( t - l ) , - - - , x ( t - p + l ) and q hidden nodes z l ( t ) , z2(t);.. , z , ( t ) , the number of which are to be optimized. Here, x( t ) , t = 1, 2;.., N -1, are the sample val-

ues of the time series to be predicted. The hyperbolic tangent functions are used as the nonlinear transfer function of the hidden nodes and the transfer function of the output layer node is linear. The single node in the output layer represents the one-step-ahead prediction. Therefore, we have

P

u , ( t ) = C w , , x ( t - j + i ) + w ~ I , i = I ,..., q J = 1

z i ( t ) = tanh[ui(t)], Y

i ( t + l ) = C v r z r ( t ) + v o . r = l

In order to apply the PMDL principle, we define two parameters z and d. z is the window siz,e of training

data, while d represents the length of the prediction range. We divided { ~ ( n ) } into k,,, = ("/d) consecutive seg-

ments with length d as a parameter. In case d does not divide N , the last se,gment is shorter.

technique to minimize the quadratic function [l]: For each network with p inputs and q hidden nodes, we first train the network using a global minimization

t=kd-T

With the so obtained optimal weights and bias from (9), we use equation (8) to predict: the points x ( t + l), t = kd, kd + 1, ..., ( k + 1)d - 1, in the subsequent (k+Z)th segment to obtain the squared 'honest' prediction error

Here, by 'honest' we mean that the parameters of the predictor are only (determined by the past data. The predic-- tions i o ( t + l ) of the data points in the very first segment are taken as zero. Hence the predictive code length C(k+l )d of the (k+l)th segment can be calculated by equation (7), where the lower and upper limits in, the sum are

replaced by kd and (k+l)d-1 respectively, and n in the last term is replaced by the segment length d. Adding the individual prediction code lengths together, we get the accumulated code length

, = I

In case d does not divide N , the code length of the last segment should be added to the results of Eq. (1 1). This procedure is repeated until the code lengths of all the segments are found. Then, we calculate the per symbol code length as

1843

Page 4: [IEEE International Conference on Neural Networks (ICNN'96) - Washington, DC, USA (3-6 June 1996)] Proceedings of International Conference on Neural Networks (ICNN'96) - Speech signal

The network with the minimum C,,,, indicates now the best predictor.

There are four parameters in equation (12). For each model candidate ( p , q), we must optimize the training window size z and the prediction region d. In general, for a nonstationary time series to be predicted, T should be long enough for the network to learn the mechanism that generates the data. On the other hand, due to time-varying characteristics of the signal, the prediction length d must be short enough so that the predictor can perform as accu- rately as possible. In our study, to simplify the problem, we make the values of these two variables equal so that only one parameter needs to be optimized.

4. Experimental Results We performed our experiment in two steps. First, in a typical office room with moderate background noise, we

acquired a record of the environment noise. The recorded noise waveform is shown in Fig. 2. Next, in the same environment, a sequence of voiced speech of the word ‘wheel’ spoken by a male was re-

corded at a sampling rate of 8 kHz. The total utterance is about 4000 samples. We only selected a small segment of 350 samples that may fall into the nonlinear region of the speech as shown in Fig. 3. Then, the noisy speech signal was processed by a three-layer neural network predictor with 1-7 input and q hidden nodes as shown in Fig. 1.

The code length of a model candidate ( p = l O , 4=2) as the function of window size z is shown in Fig. 4. It is interesting to observe that the code length reaches its minimum at the point whose temporal position corresponds to the pitch period. Here we only show one tested model, but the situations with other models are similar. In our experiment the optimal T (and hence also d) is 48 sample points.

In order to select the optimal neural network structure for successful noise attenuation and speech restoration, we confine our search to a region of a reasonable size. We use a searching set in which the complexity of all the model candidates do not exceed a certain constant. In general, a large number of hidden nodes is rarely used be- cause the computation will increase drastically. In the experiment, only the models whose number of hidden nodes is less than or equal to 10 are evaluated.

‘Thus we define the searching set

S = {model ( p , q ) I Thetotalnumber of parameters 5 165},

which can further be partitioned into two subsets

S, = {model ( p , q ) 12 5 p 5 38, 2 5 q 5 4}

and

The code lengths of different model candidates are shown in Fig. 5 and Fig. 6. Comparing the code lengths of different models, we can find that the optimal network for our speech restoration purpose has 1 X input and 2 hid- den nodes. The optimized neural network predictor was then used for predictive filtering of the noisy speech. The output of the predictive filter is given in Fig. 7. The corresponding prediction error is shown in Fig. 8. It is easy to see that the variance of the prediction error is nearly the same as that of the environment noise, as shown in Fig. 2. This demonstrates that our optimal neural network-based predictor indeed extracted the speech generation mecha- nism from the noisy speech. The entire optimization (Matlab code) of the neural network structure lasted about two days on the SGI Power Challenge supercomputer. More detailed description of the optimization of neural networks using the PMDL principle in modeling speech signals is given in [3].

Notice that in the above-mentioned process, we did not assume any characteristics of the environment noise. Therefore, our method provides an attractive approach to restore some deteriorated old interviews and other news reports contaminated by environment noise whose statistical characteristics are unknown. It is also a potential speech enhancement method for hands-free mobile phones operated in noisy car environment.

1844

Page 5: [IEEE International Conference on Neural Networks (ICNN'96) - Washington, DC, USA (3-6 June 1996)] Proceedings of International Conference on Neural Networks (ICNN'96) - Speech signal

5. Conclusions In this paper, we developed an optimal neural network-based method for speech signal restoration. The pro-

posed method uses a multilayer perceptron as a predictive filter and the structure of t b filter was optimized by the PMDL principle. Preliminary experiments demonstrate the effectiveness of the optimal neural network-based speech restoration method. The performance of our speech restoration system evaluated in comprehen,sive listening tests will be reported in future papers.

References N. Baba, “A new approach for finding the global minimum of error function of neural networks,” Neural Networks, vol. 2, pp. 367-373, 1989. D. B. Fogel, “An information criterion for optimal neural network selection,” IEEE Trans. Neural Networks, vol. 2, pp. 490-497, Sept. 1991. X. M. Gao, S. J. Ovaska, M. Lehtokangs, and J. Saarinen, “A study on modeling of speech signals using an optimal neural network structure,” Research Report 56, Department of Information Technology, Lappeen- ranta University of ‘Technology, Lappeenranta, Finland, Aug. 1995. 0. Ghitza, “Robustness against noise: the role of timing-synchrony measurement,” in Proc. ZCA,SSP, Dallas, TX, Apr. 1987, pp. 2375-2392. R. L. Kashyap, “Inconsistency of AIC rule for estimating the order of autoregressive models,” IEEE Trans. Automat. Control, vol. AC-25, pp. 996-998, 1980. N. Murata, S. Yoshizawa, and S. I. Amari, “Network information criterion for determining the number o f hidden units for an artificial neural network model,” IEEE Trans. Neural Networks, vol. 5 , pp. 8 6 8 7 2 , Nov. 1994. J. Rissanen, “Modeling by shortest data description,” Autornutica, vol. 14, pp. 465-471, 1978. J. Rissanen, “Stochastic complexity and modeling,” Annals ofSfatistics, vol. 14, pp. 1080-1 100, 1986. J. Rissanen, Stochastic Complexity in Statistical Itiquiiy. Series in Computer Science, vol. 151, Singapore: World Science Publishing Company, 1989.

[ 101 S. Tamura and A. Waibel, “Noise reduction using connectionist models,” in Proc. ICASSP, New York, NY, Apr. 1988, pp. 553-556.

Input layer 0.15 -1

-0.05

-0.10

-0.15

-0 20 0 SO 103 150 203 250 300 350

Number of samples

Figure 1. Architecture of one-step-ahead neural network predictor.

Figure 2. The waveform of background noise (variance = 5.76 x lo4 ).

1845

Page 6: [IEEE International Conference on Neural Networks (ICNN'96) - Washington, DC, USA (3-6 June 1996)] Proceedings of International Conference on Neural Networks (ICNN'96) - Speech signal

0.4

0.2

$ 0.0 !! -0.2

-0.4

-0.6

100 -

5 80 - E : 60 -

40 - 20 -

0 -

w)

v U"

-0.8 i I I I I I I

0 50 100 150 200 250 300 350 Number of samples

Figure 3. A segment of the noisy speech signal.

10 20 30 40 SO 60 70 80

Window size of [raining saniples

Figure 4. The code length of a model (p=lO, 4=2) vs the window size of training samples.

140 I 1

I I I I I 0 5 10 15 20 25 30 35 40

Numba of input nodes

Figure 5. Code lengths of different models in searctung sei S I .

Figure 6. Code lengths of different models in searching set Sa.

0.4

U 0.2

= 0.0

0 +

2 -0.2

-0.4

-0.6

-0.8 1 I I I I I 8 I 0 50 100 150 200 250 300 350

Number of samples

Figure 7. The output of the optimal neural predictor.

0 0.05

-& 0.00 E

-0.05

-0.10

v 3 i .-

- O . I S 1 -0.20 / 8

0 SO 100 150 200 250 300 350 Number of samples

Figure 8. Prediction error of the optimal neural predictor (variance = 5.28 x 10"').

1846