[IEEE 2012 International Conference on Machine Learning and Cybernetics (ICMLC) - Xian, Shaanxi, China (2012.07.15-2012.07.17)] 2012 International Conference on Machine Learning and

Proceedings of the 2012 International Conference on Machine Learning and Cybernetics, Xian, 15-17 July, 2012

A STUDY OF APPLING BPNN TO ROBOT SPEECH INTERFACE

WAN-CHEN HUANG

Associate Professor, Department of Mechanical Engineering, WuFeng University, ChiaYi, Taiwan E-MAIL:[email protected]

Abstract: For robot manipulation, it does not only require accuracy but

also a fast response if possible. Neural Network has the

advantages of high tolerance of error and has the ability of parallelism calculation. When applying to the real time speech

recognition system, through one time computation then can get the

recognition result immediately, that is different from other

methods like VQ' DTW' HMM. So, using Neural Network

method to the field for robot speech operation is a good choice. But

using Neural Network as the identifier, the dimension of input vector will large, it will occupy more memory storage, and will

affect the efficiency of calculation. Therefore, in this paper we

raise the concept to combine HMM and BPNN, it can reduce the

dimension of input vector to decrease the burden of memory

storage; on the other hand, it can also promote the calculating

efficiency. For resolving a general BP network problem of slow convergence while training, in this paper we raise the concept of

using the recognition rate as a factor to judge whether to stop the training procedure or not, which can save more training time and

can also get the required recognition rate.

Keywords:

HMM; BPNN; Viterbi Algorithm

1. Introduction

Robot's application evolved gradually from the early industrial applications into the educational, entertainment, security, and service areas. It is being paid more attention gradually. To make robot has a more wide application in our lives, it needs not only to has various functions such as obstacle avoidance, security preservation, fire control. .. etc, but also a more human-oriented user interface to manipulate the robot, speech interface can afford an more convenient and natural interaction environment. The current user interface mostly relies on a keyboard or touch-screen. In some cases these techniques may have limitations. For instance, those who are

978-1-4673-1487-9/12/$31.00 ©2012 IEEE

physically handicapped may find it operationally difficult by using these interfaces.

To implement the speech recognition system to the real life application, two major issues to be resolve. The first issue is the voice signal contains psychological and physiological properties of the speaker. For these reason the same uttered by different speaker (even though the same speaker at different time) will get different recognition. The second issue is the background noise and the medium for sound will interfere or distort the voice signal, then affect the result of recognition [1].

In the early studies, the speech recognition adopted Dynamic Time Wrapping(DTW) [2,3] or Vector Quantization(VQ) [4 ], to overcome the embarrassing situation due to the inconsistent frames caused by the variation of speech speeding and to compress the amount of speech data to promote the efficiency of classification. However their antinoise abilities are not good enough. Besides, these methods require a lot of time during the recognition procedure. Therefore, it is not suitable to use those methods in the real time response situation. Recently, the Back-Propagation Neural Network (BPNN) has been applied to speech recognition instead of these traditional methods gradually, and there are many related applications [5]. HMM (Hidden Markov Model) [6] adopts statistic concept to describe the characteristic of speech, and through several years of development has become one of the main models on speech recognition system, especially on continuous speech recognition system. However, in the actual application, it is not an accommodation way compared to DTW or VQ due to its complicated calculation. In this paper we combine BPNN and HMM to get a more efficiency model for speech recognition system. HMM can afford a more accurate description for a speech and BPNN can afford a simply but powerful identifier, so it can apply to the robot real time manipulation system.

1681


2. Robot Manipulation System

The application for robot is to help people to do something then can let people to enjoy their life. So, the function for the robot must variety and its manipulation interface must humanoriented. Currently, most of the interface for the robot manipulation system use keyboard ' mouse ' touch panel or

joystick. It is not so friendly for user especially for those disable person, some of them are not easy to operate those interface. Speech is one of the most important medium for communication; it contains the personal characteristic information. Base on the human-oriented point of view, applying speech as a interface between people and robot is a good choice.

A skeleton for speech interface to manipulate robot as shown in Figure. 1. The operator send command through microphone, then the identification process must to make a decision with the speech database built before as fast as possible and sent out through the WiFi to control the robot.

f.)" � « » �. 1Q - 0 . -{'o.),_, /

�: - , WiFi

� SP"Xh Mo:leb identifIcation Proo::ss

DrRol:DI

Figure.1 skeleton for speech interface to robot

Therefore, the speech recognition system for the robot manipulation must apply a method has both advantages, which can overcome the recognition rate decreasing due to the background noise and individual factor; and has an efficient computation algorithm, can afford a fast response to the robot. According to those demands, the Neural Network is a very good choice.

3. Study for the BP Neural Network

Back Propagation (BP) algorithm is the one of the extended algorithm of Least Mean Square (LMS), it can be used to train the multilayer Neural Network. Both BP and LMS use Mean Square Error (MSE) as the index of performance function. To find the direction of the optimum weight adjustment, these two methods use the steepest decent method during training process. One of their differences lies on the calculation of gradient. In order to find the derivative of error function to the weight within the hidden layer and bias, BP

algorithm uses chain rule. Derivative process starts from the last layer (ref Figure.2), then back propagate through the whole network, and use chain rule to get the derivative in the hidden layer [7].

Figure 2. The sketch of two layers NN

As the LMS, BP algorithm requires a long training time to

get convergence. Equation (1) is the expression of LMS, F is the performance function; w express the weight between

neurons; e is the error vector; that express the difference

between target vector t and the output vector a .

F(w) = E[eT e] = E[(t - al (t - a)] (1)

/\

In the algorithm of LMS we use F(w) to approach the

Equation (1) as shown in Equation (2)

/\

F(w) = eT (k)e(k) = (t(k) - a(k))T (t(k) - a(k)) (2)

We can find that the expect value of mean square error been

displaced by k 's iteration of mean square error. The same

situations will occur when calculating the gradient. Theoretically, should use eq. (3) to find the gradient for the weight to the performance function.

8F(w) = e(k)

8e(k) 8w 8w (3)

In practice, we apply the gradient estimation at the k step to

calculate the gradient for the next iteration as shown in eq. (4 ) /\

8 F(w) = e(k)

8e(k) 8w(k) 8w(k)

(4 )

As discussed above, we use the estimated value for the error and gradient in every iteration step to replace the actual

1682


value. Therefore, it is harmful to renew the weight for each step, which is one of the reasons cause slow convergence. In addition, unlike the single layer NN, its performance function is a bifunction Hessian matrix is a constant matrix. So the curvature is un-variant for a given direction. The surface of performance for multilayer NN is difficult to describe, and it has many saddle points. It is easy to drop into those local minimum solutions and can not find the global minimum solution. Therefore, it tends to miss the optimum result. Although, there are many improvement method like Conjugate Gradient, Newton Method etc, applying those methods to the speech recognition area seems to be useless, because these methods spend a lot of computing time for each iteration step.

Although BPNN has its disadvantage, but on the application for speech recognition system it can build many different speech signal into a network model, then can get the recognition result after one computation process. It is different from other methods, which must build models one by one and recognize one after another, it causes them difficult to apply to the situation requiring real time response. So BPNN has its practical value on the speech recognition field.

Usually, BPNN decides whether the system convergent or not is according to the value of MSE. But it wastes time considerably [7,8], especially on the speech recognition system. This has a high dimensional input vector. In order to overcome this disadvantage of BPNN, in this paper we use the recognition rate to decide whether to stop the training or not. In Figure.3a, it shows the variation of MSE with different feature while training. We can observe that the MSE decreases quickly at the beginning of training, and then becomes flatter. It indicates that the decrease of MSE becomes slower. So, if we make the decision according to the value of MSE it will spend a lot of time. At the same time, we observe Figure. 3b. it show the variation of recognition rate of the system, can find out the recognition rate reach a satisfied situation and stable. Therefore, if we use this to decide whether to stop the training procedure it can be accepted. Simon [8] mentioned that it must check the generalization performance of network after each iteration. When its value is suitable or reach to a peak obviously; the procedure of learning can be stop. Therefore, in this paper we raise the point of view to use the recognition rate as the basis to decide whether to stop the training procedure or not. This can avoid dropping in the disadvantage of slow convergence and save much of time while training and can get the recognition rate we require.

Sum of Square Error oo,-'"�� __ ----�------�======� 40

30

20

10

o

30

20

10

0

\ "

, R

5

{)'O

'R� __

5

Iteration for e-.ery epoch Sum of Square Error

10 Iteration for e-.ery epoch

------3-- LPC& LPCC - -f)- LPC& DLPCC - -R - LPG & DLPCC

25

Figure 3a. The variation of MSE with different feature.

5

0.5 i / I , i/

III 0

0 5

Identification Rate

10 15 20 25 Iteration for e\oery epoch

Identification Rate seeBBooeeMeeeeei��iti

- B- LPC& LPCC - � - LPC& DLPCC

- *- LPCC& DLPCC

10 15 20 25 Iteration for e-.ery epoch

30

30

Figure 3h. The variation of recognition rate with different features.

4. Combine BPNN with HMM

Hidden Markolv Model (HMM) uses abstract probability statistics model to describe the characteristic of a speech. Its distinguishing features are as follows. 1. Applying the concept of statistics and state transition to build the speech model, it can overcome the time variant characteristic contain inside the observing data. 2. It can describe the different level of phonic unit, like a word, syllable, phoneme and etc, and it can assemble a large phonic unit model from many smaller units, like from a word model to a sentence model. 3. It can get the best state series through Viterbi algorithm and then get the best partition of speech to order. So, a word or a continuous speech

1683


can use this method. This is also a popular method which been applied to the speech recognition system in the recent years.

Figure 4 shows a sketch of HMM. The S l � sn

express

different speech state, 1; � 1m express the total frames after

framing procedure. The state transition can move from left to right or stay at itself, because due to the variation of the organ to make sound like track, lip, tongue and nasal have its own special order, can not been altered arbitrarily. Taking the feature of a speech after framing, like Linear Prediction Coefficient (LPC), Cepstrum (LPCC) etc, we can build a feature vector for each frame. Through Viterbi algorithm, we can set those sequence of observable feature vector distribute to different speech state. So, each state can posses some frames in sequence.

--- -fj OD[}-·- DO- _. 1010 O· 001000

Figure 4. A sketch of HMM

With a speech signal built by HMM, we can get the frames associated to each state. Then we take the mean value and covariance vector for each state, as shown in Eq. (5) and Eq. (6)

nl M(S)i = Ll\k + nl

k=1

nl V(S)i = (Ll\� - M;) + nl

k=1

n1 : the number of frames belong to this state. i= I � dimension of feature matrix. s= 1 � number of speech state.

(5)

(6)

Then we can compose a representative feature vector for each state from those mean value and covariance vector. Using those feature vectors as the input vector to network, it can reduce the dimension significantly and a great deal of the need to memory space, then promotes the calculating efficiency.

There is an example to illustrate the merit of this method. If we get 35 frames after framing for a speech, take 12-order of LPCC and DLPCC (the 1st order of differential of LPCC) as feature, for each frame we have to build a 24-order feature vector. Therefore, with those feature vectors as the input of network, we will get a vector of 840 x 1. But if we can combine HMM, separate those frames accurately with the concept of speech state, it can decrease the dimension effectively. Assume we take 4 states for a speech, use the same feature as before, and then calculate their mean value and covariance for each state respectively to build a feature vector to represent each state status. We will get a feature vector of 48-order dimension and can build a input vector of 192 x 1. This is less then a quarter of the original by just use NN as usual. Its result of recognition is still satisfactory even though the input data has been cut significantly. Figure 5 and Figure 6 show the ability of recognition with HMM and BPNN for speaker dependent speaker independent respectively. We can find that for new method its recognition ability is as good as the BPNN as usual. It can explain that the new method to combine HMM and BPNN is an acceptable method.

Figure 5. The variation of recognition rate (speaker dependent)

��j(;:��-::-- l lter",t;on �or every epoch

Figure 6. The variation of recognition rate (speaker independent)

5. Results and discussion

Verify the effectiveness of the system was in a general laboratory environment, there was no deliberate to limit that student's activation. In order to get the authenticity of the results, and system's availability, the processes of testing based on the program randomly selected one of the commands, each

1684


command appears five times, and then sent this command by user, and then, to make the final decision by the computer, at the same time to do the statistics. The best one can reach 100% recognition rate; the worst one was 86%, the rest were about 90%. In a general laboratory environment, had a good effect of the implementation, the average recognition rate will up to 93%. This paper based on the HMM as a basis for modeling, using Viterbi algorithm to find the probability of the best path for the frames and the states. At identification process, it also based on Viterbi algorithm to calculate the probability for the speech command been determined and the speech model, as to make the final decision for this speech command's meaning. From the recorders, it shown that the choice of vocabulary for the speech command was very important. To treat those commands have same word paragraph (in Chinese pronunciation) such as turn right, turn left, acceleration, deceleration ... etc. Because we applied two-word command type as a unit to modeling and identify, so at the testing, owing to the variation for personal characters caused the uncompleted data capture, and the to make the wrong recognition. In this paper, in order to obtain a faster response, we do not apply any noise reduction or increasing system robustness algorithm, however the overall system's identification rate is still up to 93%. Thus, simply using BPNN combine with HMM to speech recognition as a way to design the human-machine interface, in a general environment, such as the home place, the effect is acceptable.

Acknowledgements

Thanks to my colleagues Mrs. Huang, Mrs. Thu and students Liou, Tsai, Shyu for helping me record and collect the data of speech. Their efforts make this work progress much more smoothly.

References

[1] Sung-Lin, Chen,"The Speech Recognition System using Neural Networks", Thesis of National Sun Yat-Sen University Department of Electrical Engineering, July 2002

[2] H. Sakoe and S. Chiba, "Dynamic Programming Optimization for Spoken WordRecognition," IEEE Trans on ASSP, Vol.26, pp 43-49, Feb. 1978.

[3] C. Myers and L.R. Rabiner, "Performance Tradeoffs in Dynamic Time Warping Algorithms for Isolated Word Recognition," IEEE Trans on ASSP, Vol.28, No.6,pp 623-635, Dec. 1980.

[4 ] Hu-Han, "Speech Signal Processing", Haotushu Publisher, May 2002

[5] D.P. Morgan and C.L. Scofield, "Neural Networks and Speech Processing", Kluwer Academic, 1991.

[6] L.R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applicationsin Speech Recognition," IEEE Trans on ASSP, Vol. 77 , No.2, pp 257- 286, Feb1989.

[7] Martin T. Hagan, "Neural Network Design", PWS, 1999 [8] Simon,"Principle of Neural Network", PWS, 2002

1685

Documents

[IEEE 2012 International Conference on Machine Learning and Cybernetics (ICMLC) - Xian, Shaanxi, China (2012.07.15-2012.07.17)] 2012 International Conference on Machine Learning and