5
Speech enhancement based on FLANN using both bone- and air-conducted measurements Boyan Huang * , Yegui Xiao , Jinwei Sun * , Guo Wei * , and Hongyun Wei , * Department of Automatic Testing and Control, Harbin Institute of Technology, Harbin, China. E-mail: [email protected] Tel: +86-15104655965 Prefectural University of Hiroshima, Hiroshima, Japan. Akita International University, Akita, Japan. Abstract—Bone conduction has been used for speech enhance- ment in very noisy environments. It usually includes nonlin- earity in transmission, small non-stationary noise due to body movements (friction, collision, wind noise, crosstalk and so on), serious attenuation of high frequency components, etc. These problems usually lead to poor intelligibility of bone-conducted (BC) speech. Recoverying an air-conducted (AC) speech from a bone-conducted recording alone has been considered to be a very difficult task. In this paper, we propose a nonlinear adaptive noise canceller (ANC) that uses both bone- and air- conducted measurements for speech recovery. In this ANC, bone- conducted measurement is used as the reference signal while the air-conducted one is adopted as the primary signal, and a functional link artificial neural network (FLANN) is introduced as the nonlinear adaptive filter. Application to real speech signals is conducted to confirm the effectiveness of the proposed system. I. I NTRODUCTION Strong background noise seriously affects the quality of speech recognition as well as voice communications. Usually it is difficult to isolate a speech signal from the background noise. So far, a vast number of techniques have been developed for the enhancement of speech signals, such as Kalman filter, spectral subtraction, and their variants [1],[2]. Recently, some advanced signal processing techniques such as the ICA, multi- sensory algorithms etc. have been used to perform the recovery task, see, e.g., [3], [4] and references therein. However, when the background noise is very harsh, all of them lose their capabilities of enhancement. More powerful techniques and algorithms are needed to take care of such difficult situation. On the other hand, with the development of electronic technology, bone-conduction (BC) based technology has en- tered people’s life. Different from traditional air conducted (AC) equipment, speech signal received is not input through the external ear canal, but directly to the inner ear cochlear duct, thus stimulating the auditory nerve and helping hearing loss [5]. Communication equipment collects the sound source through the vibrations incurred by a highly sensitive vibrating sensor when collecting from the skull rather than the sound transducer, and the source is then converted into an audio signal [6]. As thus, in communication under intensively noisy 1 This work was supported in part by the National Natural Science Foundation of China (61171183), 2012 Aerospace support fund(01320214), Interdisciplinary Basic Research of Science-Engineering-Medicine in Harbin Institute of Technology(01508293) environments, useful signals can also be clearly transmitted [6]. In addition, bone-conducted communication equipment features good water resistance, portability and can greatly im- prove the ability of speech recognition in noisy environments [7]. So far, this technology has been applied to fire control, military, forestry, oil exploration and exploitation, mining, public transportation, emergency rescue, secret service, en- vironmental protection, engineering construction and more. However, due to the use of highly sensitive sensors, a series of noise is introduced when using bone-conducted microphone, such as noise elicited by friction between helmet lace and the sensor, and noise generated by wind on the sensor when riding on a motorcycle. These noises all affect communication quali- ty. Employing bone-conducted communication products under high SNR environments is less effective than using ordinary communication equipment, thus limiting the application and development of bone-conductd technology. The speech recording obtained by bone-conduction is only relevant to human body, not to the ambient background noise. However, the high-frequency components of a bone- conducted speech have severe transmission attenuation [7]. Air conduction refers to the acquisition of signals through the traditional voice microphone, and the low-frequency signals are easy to be buried in strong background noise. From these two points, bone conduction and air conduction have their own advantages and disadvantages. If the two can be combined, namely using low-frequency portion of signals acquired via bone conduction and high-frequency portion of signals by air conduction, we can gain more comprehensive speech signal, thereby extracting and recovering voice messages more clearly and accurately [8],[9]. In [8], bone-conducted measurement and air-conducted speech are used as the reference signal and primary signal respectively for an adaptive noise canceller (ANC) updated by the LMS. The adaptive FIR filter can realize the linear transmission path perfectly, but the nonlinearity of the transmission path from the vibration pickup (bone- conducted speech) to the microphone (air-conducted sound pressure) will affect the system performance in a significant way. In [9], A neural network (NN) is introduced to cope with the nonlinearity part and a backpropagation algorithm is used to update their ANC. In this paper, we propose a nonlinear ANC by using of a functional link artificial neural network (FLANN) to imple- 978-616-361-823-8 © 2014 APSIPA APSIPA 2014

Speech enhancement based on FLANN using both …speech enhancement system, various simulations are carried out for the real speech signal recovery. The same Japanese sentence was spoken

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Speech enhancement based on FLANN using both …speech enhancement system, various simulations are carried out for the real speech signal recovery. The same Japanese sentence was spoken

Speech enhancement based on FLANN using bothbone- and air-conducted measurements

Boyan Huang∗, Yegui Xiao†, Jinwei Sun∗, Guo Wei∗, and Hongyun Wei‡,∗Department of Automatic Testing and Control, Harbin Institute of Technology, Harbin, China.

E-mail: [email protected] Tel: +86-15104655965†Prefectural University of Hiroshima, Hiroshima, Japan.

‡Akita International University, Akita, Japan.

Abstract—Bone conduction has been used for speech enhance-ment in very noisy environments. It usually includes nonlin-earity in transmission, small non-stationary noise due to bodymovements (friction, collision, wind noise, crosstalk and so on),serious attenuation of high frequency components, etc. Theseproblems usually lead to poor intelligibility of bone-conducted(BC) speech. Recoverying an air-conducted (AC) speech froma bone-conducted recording alone has been considered to bea very difficult task. In this paper, we propose a nonlinearadaptive noise canceller (ANC) that uses both bone- and air-conducted measurements for speech recovery. In this ANC, bone-conducted measurement is used as the reference signal whilethe air-conducted one is adopted as the primary signal, and afunctional link artificial neural network (FLANN) is introducedas the nonlinear adaptive filter. Application to real speech signalsis conducted to confirm the effectiveness of the proposed system.

I. INTRODUCTION

Strong background noise seriously affects the quality ofspeech recognition as well as voice communications. Usuallyit is difficult to isolate a speech signal from the backgroundnoise. So far, a vast number of techniques have been developedfor the enhancement of speech signals, such as Kalman filter,spectral subtraction, and their variants [1],[2]. Recently, someadvanced signal processing techniques such as the ICA, multi-sensory algorithms etc. have been used to perform the recoverytask, see, e.g., [3], [4] and references therein. However, whenthe background noise is very harsh, all of them lose theircapabilities of enhancement. More powerful techniques andalgorithms are needed to take care of such difficult situation.

On the other hand, with the development of electronictechnology, bone-conduction (BC) based technology has en-tered people’s life. Different from traditional air conducted(AC) equipment, speech signal received is not input throughthe external ear canal, but directly to the inner ear cochlearduct, thus stimulating the auditory nerve and helping hearingloss [5]. Communication equipment collects the sound sourcethrough the vibrations incurred by a highly sensitive vibratingsensor when collecting from the skull rather than the soundtransducer, and the source is then converted into an audiosignal [6]. As thus, in communication under intensively noisy

1This work was supported in part by the National Natural ScienceFoundation of China (61171183), 2012 Aerospace support fund(01320214),Interdisciplinary Basic Research of Science-Engineering-Medicine in HarbinInstitute of Technology(01508293)

environments, useful signals can also be clearly transmitted[6]. In addition, bone-conducted communication equipmentfeatures good water resistance, portability and can greatly im-prove the ability of speech recognition in noisy environments[7]. So far, this technology has been applied to fire control,military, forestry, oil exploration and exploitation, mining,public transportation, emergency rescue, secret service, en-vironmental protection, engineering construction and more.However, due to the use of highly sensitive sensors, a series ofnoise is introduced when using bone-conducted microphone,such as noise elicited by friction between helmet lace and thesensor, and noise generated by wind on the sensor when ridingon a motorcycle. These noises all affect communication quali-ty. Employing bone-conducted communication products underhigh SNR environments is less effective than using ordinarycommunication equipment, thus limiting the application anddevelopment of bone-conductd technology.

The speech recording obtained by bone-conduction is onlyrelevant to human body, not to the ambient backgroundnoise. However, the high-frequency components of a bone-conducted speech have severe transmission attenuation [7].Air conduction refers to the acquisition of signals through thetraditional voice microphone, and the low-frequency signalsare easy to be buried in strong background noise. From thesetwo points, bone conduction and air conduction have their ownadvantages and disadvantages. If the two can be combined,namely using low-frequency portion of signals acquired viabone conduction and high-frequency portion of signals by airconduction, we can gain more comprehensive speech signal,thereby extracting and recovering voice messages more clearlyand accurately [8],[9]. In [8], bone-conducted measurementand air-conducted speech are used as the reference signal andprimary signal respectively for an adaptive noise canceller(ANC) updated by the LMS. The adaptive FIR filter can realizethe linear transmission path perfectly, but the nonlinearityof the transmission path from the vibration pickup (bone-conducted speech) to the microphone (air-conducted soundpressure) will affect the system performance in a significantway. In [9], A neural network (NN) is introduced to cope withthe nonlinearity part and a backpropagation algorithm is usedto update their ANC.

In this paper, we propose a nonlinear ANC by using of afunctional link artificial neural network (FLANN) to imple-

978-616-361-823-8 © 2014 APSIPA APSIPA 2014

Page 2: Speech enhancement based on FLANN using both …speech enhancement system, various simulations are carried out for the real speech signal recovery. The same Japanese sentence was spoken

LMS

( )x n

( )e n

( )d n

( )y n

Air-conducted

speech

FIR

Filter

Bone-conducted

speech

Fig. 1. Block diagram of conventional ANC for speech enhancement.

ment the nonlinear transmission path to improve its ability torecover the high-frequency components of BC signal. Simula-tions using real bone- and air conducted speech measurementsare performed to demonstrate the effectiveness and superiorityof the proposed system over the previous one using only theFIR filter.

II. A NONLINEAR ANC FOR SPEECH ENHANCEMENT

The conventional FIR filter based ANC for speech en-hancement is depicted in Fig.1 [8],[9]. The bone-conductedspeech x(n) and the air-conducted speech d(n) are used asthe reference and primary signals, respectively. The primarysignal is consisted of a clean speech signal and an additivenoise v(n) with very large power. The output of the FIR filtery(n) is calculated as

y(n) =L−1∑i=0

ci(n)x(n− i) (1)

where {ci(n)}L−1i=0 are the filter coefficients of the FIR filter,

L indicates the length of the filter. The update equation forci(n) is given as

ci(n+ 1) = ci(n) + µe(n)x(n− i) (2)

Where µ is a small positive value, called step sizes. Theselection of µ directly influences the performance of thesystem.

e(n) = d(n)− y(n). (3)

This linear ANC can deal with the linearity effectively. Thehigh frequency components in BC speech is attenuated severe-ly owing to the nonlinearity of the transmission path. Inorder to recover this high-frequency portion and improve theintelligibility, an FLANN is introduced to approximate thenonlinear transmission path. Both the linear and the nonlinearfilters in the ANC are updated by LMS algorithm, as shownin Fig.2. The output of the FLANN is calculated by

yf (n) =P∑

p=1

N−1∑i=0

{Ws,p,i(n) sin[pπx(n− i)]

+Wc,p,i(n) cos[pπx(n− i)]} (4)

where P is the nonlinear extension order of FLANN, N is thelength of the control filter. The total output of the proposed

( )x n

( )fe n

( )d n

( )y n

( )fy n

Fig. 2. Proposed speech enhancement combined with nonlinear and linearANC.

speech enhancement system is y(n) + yf (n), the coefficientupdating equations of the nonlinear part are given by

Ws,p,i(n+ 1) = Ws,p,i(n) + µfe(n) sin[pπx(n− i)](5)Wc,p,i(n+ 1) = Wc,p,i(n) + µfe(n) cos[pπx(n− i)](6)

where µf is the step-size parameter for the nonlinear ANC,and it takes small positive value. The residual error ef (n) isexpressed as

ef (n) = d(n)− y(n)− yf (n) (7)

FLANN based active noise control using trigonometric expan-sions has been researched in [10], and was confirmed to havea good performance on nonlinear active noise control with areasonable complexity. This paper adopted the FLANN filteras the nonlinear part ANC, and the conventional FIR filter asthe linear part. The combinational system may have both goodability in recovering the low and high frequency componentof speech signal.

III. SIMULATION

In order to demonstrate the effectiveness of the proposedspeech enhancement system, various simulations are carriedout for the real speech signal recovery. The same Japanesesentence was spoken by a male and a female in an ane-choic chamber, while a vibration pickup and a microphoneare used to measure the bone- and air conducted speechsignals, respectively, at the sampling frequency of 10 kHzand measurement duration of 4.2 seconds. Additive noise withdifferent power is added to the air-conducted measurementsin our simulations. The MSE between the recovered and theoriginal speech signals is used to evaluate the performances ofthe conventional and proposed systems. Ten (10) independentruns were performed to obtain a reasonably fair comparison.

The original (air-conducted) clean speech, the noisy (air-conducted) speech, the reference signal (bone-conductedspeech), the recovered speeches by the conventional andproposed system are given in Fig. 3 and Fig. 4 for the male andfemale subjects, respectively. The length of the FIR filter is setto L = 128. The order of FLANN is set to P = 60, the lengthof the control filter is N = 128. It’s noted that the setting

Page 3: Speech enhancement based on FLANN using both …speech enhancement system, various simulations are carried out for the real speech signal recovery. The same Japanese sentence was spoken

0 0.5 1 1.5 2 2.5 3 3.5 4

x 104

−0.2

0

0.2

(Cle

an

)

0 0.5 1 1.5 2 2.5 3 3.5 4

x 104

−0.2

0

0.2

(No

isy

)

0 0.5 1 1.5 2 2.5 3 3.5 4

x 104

−0.2

0

0.2

(Bo

ne

)

0 0.5 1 1.5 2 2.5 3 3.5 4

x 104

−0.2

0

0.2

(FIR

)

0 0.5 1 1.5 2 2.5 3 3.5 4

x 104

−0.2

0

0.2

(FL

AN

N)

Iteration number n

Fig. 3. Related speech signals of the male subject ( top: clean speech,2nd: noisy speech, 3rd: bone-conducted speech, 4th: recovered speech (FIR),bottom: recovered speech (FLANN), SNR=-1 dB).

0 0.5 1 1.5 2 2.5 3 3.5 4

x 104

−0.2

0

0.2

(Cle

an

)

0 0.5 1 1.5 2 2.5 3 3.5 4

x 104

−0.2

0

0.2

(No

isy

)

0 0.5 1 1.5 2 2.5 3 3.5 4

x 104

−0.2

0

0.2

(Bo

ne

)

0 0.5 1 1.5 2 2.5 3 3.5 4

x 104

−0.2

0

0.2

(FIR

)

0 0.5 1 1.5 2 2.5 3 3.5 4

x 104

−0.2

0

0.2

(FL

AN

N)

Iteration number n

Fig. 4. Related speech signals of the female subject ( top: clean speech,2nd: noisy speech, 3rd: bone-conducted speech, 4th: recovered speech (FIR),bottom: recovered speech (FLANN), SNR=-1 dB).

of P and N has great influence on the system performance.Larger P and N will provide a better steady-state performance,but the complexity of the system will increase significantly.The step size of the FIR filter in the conventional systemwas µ = 0.25. The step sizes of the linear and nonlinearpart in the proposed system were µ = µFIR = 0.25 andµf = 0.000000001, respectively.

From Figs. 3-4, the two systems show a similar ability inreconstructing the clean speech despite the poor SNR (-1 dB)setting. However, The proposed system has a smaller MSEbetween the clean and the recovered speech signals as shownin Table I. Here, the MSE was calculated by the late half of

TABLE ICOMPARISON OF MSE BETWEEN FIR AND FLANN FILTERS

MSE ( ×10−4)SNR (dB) 100 10 1 0 -1 -3

Male FIR 1.4898 1.5143 1.6871 1.7336 1.7980 1.9754FLANN 1.2936 1.3142 1.4962 1.5883 1.6774 1.7806

Female FIR 1.8895 1.8922 1.9063 1.9128 1.9287 1.9481FLANN 1.7528 1.7604 1.7702 1.7801 1.8023 1.8103

the signals. It is very clear that the proposed system combinedwith FLANN and FIR filters outperforms the conventional onewith a FIR filter only for all simulated scenarios.

Time (sec)

Fre

qu

en

cy

(H

z)

1 2 3 40

1000

2000

3000

4000

5000

(a) Clean (air-conducted) speech

Time (sec)

Fre

qu

en

cy

(H

z)

1 2 3 40

1000

2000

3000

4000

5000

(b) Noisy air-conducted speech (SNR=100 dB)

Time (sec)

Fre

qu

en

cy

(H

z)

1 2 3 40

1000

2000

3000

4000

5000

(c) Bone-conducted speech

Page 4: Speech enhancement based on FLANN using both …speech enhancement system, various simulations are carried out for the real speech signal recovery. The same Japanese sentence was spoken

Time (sec)

Fre

qu

en

cy

(H

z)

1 2 3 40

1000

2000

3000

4000

5000

(d) Recovered speech by FIR filter

Time (sec)

Fre

qu

en

cy

(H

z)

1 2 3 40

1000

2000

3000

4000

5000

(e) Recovered speech by FLANN

Fig. 5. Time-frequency analysis of related speech signals of the male subject(SNR = 100 dB).

The time-frequency analysis results of related speech signalsof the male subject with a SNR of 100 dB are provided in Fig.5. It can be seen clearly that the proposed system with bothlinear and nonlinear ANC subsystems exhibit an improvedability in reconstructing the high-frequency components fromthe bone-conducted speech than the conventional FIR-basedsystem does.

Time (sec)

Fre

qu

en

cy

(H

z)

1 2 3 40

1000

2000

3000

4000

5000

(a) Clean (air-conducted) speech

Time (sec)

Fre

qu

en

cy

(H

z)

1 2 3 40

1000

2000

3000

4000

5000

(b) Noisy air-conducted speech (SNR=-1 dB)

Time (sec)

Fre

qu

en

cy

(H

z)

1 2 3 40

1000

2000

3000

4000

5000

(c) Bone-conducted speech

Time (sec)

Fre

qu

en

cy

(H

z)

1 2 3 40

1000

2000

3000

4000

5000

(d) Recovered speech by FIR filter

Time (sec)

Fre

qu

en

cy

(H

z)

1 2 3 40

1000

2000

3000

4000

5000

(e) Recovered speech by FLANN

Fig. 6. Time-frequency analysis of related speech signals of the female subject(SNR = -1 dB).

Page 5: Speech enhancement based on FLANN using both …speech enhancement system, various simulations are carried out for the real speech signal recovery. The same Japanese sentence was spoken

Fig. 6 provide the time-frequency analysis results of relatedspeech signals from a female where the SNR was set to be−1 dB. The air-conducted speech is strongly polluted by thebackground noise, and the bone-conducted speech has a greatloss in the high-frequency region during the transmission.Both the conventional and the proposed system can provide aproper recovery of the speech, but the latter presents somehigher recovering capability than the former. However, theperformance difference is not significant and further researchfor powerful nonlinear filters is definitely an interesting futuretopic.

IV. CONCLUSION

In this paper, an FLANN filter using trigonometric expan-sions and an FIR filter are combined together as a nonlinearANC for speech enhancement. The proposed system takes thebone-conducted speech as its reference signal and adopts theheavily polluted air-conducted speech as its primary signal.Numerous simulations using a real dataset with bone- and air-conducted speech recordings have shown that the proposedsystem enjoys better performance than previous one thatequipped with FIR filter alone. Developing more efficient andpowerful speech enhancement system using bone- and air-conducted signals is an open topic for further research.

REFERENCES

[1] S. Gannot, D. Burshtein, and E. Weinstein, “Iterative and sequen-tial Kalman filter-based speech enhancement algorithms,” IEEE Trans.Speech, Audio Process., vol. 6, no. 4, pp. 373-385, Jul. 1998.

[2] S. Boll, “Supression of acoustic noise in speech using spectral subtrac-tion,” IEEE Trans. Acoust. Speech Signal Process., vol. 27, no. 2, pp.113-120, 1979.

[3] X. Zou, P. Jancovic, J. Liu, and M. Kokuer, “Speech signal enhancementbased on MAP algorithm in the ICA space,” IEEE Trans. Signal Process.,vol. 56, no. 5, pp. 1812-1820, May 2008.

[4] N. Yousefian and P. C. Loizou, “A dual-microphone speech enhancementalgorithm based on the coherence function,” IEEE Trans. Audio, Speech,Lang. Process., vol. 20, no. 2, pp. 599-609, Feb. 2012.

[5] S. Suge, D. Koizumi, and M. Fukumi, “Speaker verification methodusing bone-conduction and air-conduction speech,” Intelligent SignalProcessing and Communication Systems, pp. 449-552, 2009.

[6] Srinivasan, Sriram, and Kechichian, “Robustness analysis of speechenhancement using a bone conduction microphone preliminary results,”Acoustic Signal Enhancement, Proceedings of IWAENC 2012, pp. 1-4,2012.

[7] H. S. Shin, H. G. Kang, and T. Fingscheidt. “Survey of speech enhance-ment supported by a bone conduction microphone,” Speech Communica-tion, pp. 1-4, 2012

[8] J. Yu, L. Zhang, and Z. Zhou, “A novel voice collection scheme basedon bone-conduction,” Proc. ISCIT2005, pp. 1126-1129, 2005.

[9] T. Shimamura and T. Tamiya, “Learning for bone-conducted speech viaadaptive and neural filters,” Proc. Intl. Conf. Signals and ElectronicSystems, Sep. 2006.

[10] G. L. Sicuranza, and A. Carini, “A generalized FLANN filter fornonlinear active noise control,” IEEE Trans. Speech, Audio Process., vol.19, no. 8, pp. 2412-2418, 2011.