Generalized Cross-Corr elation based Noise Robust Abnormal ... · normalizing elements of reconstructed spectrogram . (7) 4. Experiments For the purpose of evaluation, we used street

Abstract

In this paper, robust sound sourcsurveillance system is presented. In paran algorithm for abnormal acoustic using non-negative matrix factorizatiobin weighting. Based on the abnormlocalization experiments in real acoustproposed algorithm’s excellent strengterms of representative performance methe conventional method.

1. Introduction Finding abnormal acoustic event loca

active research area with applicatiosystems [1-3] . In recent years, it becamin surveillance systems related to acousas sudden noise detection and localizatioscream detection [1-2]. Among the vGeneralized Cross Correlation (GCC) bpopularly used because of their simp[4-6].

For a real environment, the abnorlocalization algorithm has to be robustnoise source and low SNR. In GCC basonly desired signal-dominant frequavoiding noise-dominant bins is rereliable estimation performance.

Previous investigations use speechnoise estimating methods for noise Denda et al. proposed weighted - CroPhase (CSP) coefficients based on spectrum [5]. Ichikawa et al. proposed abased weighting method for robulocalization [6]. Both of these methods hin source localization for speech sourdesigned for speech sources, however, tfor non-speech acoustic events such aalarm sounds.

Generalized Cross-CorrLocalization u

Sungkyu Moon

Department of Visual Sch Information Processing

Korea University, Ko Seoul, Korea, [email protected] swshon@

ce localization for rticular, we propose

event localization on based frequency mal acoustic event tic environment, the gth is validated in easures compared to

alization has been an ons in surveillance

me an important topic stic processing such on in car, gunshot or various algorithms,

based algorithms are ple implementation

rmal acoustic event t to various types of sed algorithm, using uency bins while equired to achieve

h characteristic and robust weighting.

oss-power Spectrum an average speech a harmonic structure ust speech source have shown success

rces. Since they are they are not suitable

as breaking glass or

Shon proposed Selective Frebased weighting function [frequency bin represents well eaproperties. However, due to thecan be situations when high nselected frequency bins.

To address this issue, seledesired signal frequency bin inpropose a novel weighting funMatrix Factorization (NMF). Thprocess is to construct each abnweights of NMF matrix pair by clean abnormal event data base.matrix can then be used on a noidesired abnormal event comapproach is based on an assumpsparse represent desired abnorm

Figure 1: Block diagram of the prop

localization

2. Abnormal Acoustic EvAn overview of the propose

Figure 1. First, an abnormal acois needed. A hierarchical structuevent recognition approach, dev

relation based Noise Robust Abnormalutilizing Non-negative Matrix Factoriz

Suwon Shon ool of Electrical Engineering orea University, Seoul Korea @ispl.korea.ac.kr

Wooil Kim School of Computer

Science & Engineering Incheon National University,

Incheon, Korea [email protected]

equency Bin (SFB) method [3]. Pre-trained selected ach abnormal acoustic event e weights being fixed, there noise levels present in the

cting (or weighting high) n non-stationary noise, we nction using Non-negative he first step of the proposed ormal event bases and their training with labeled set of

. Once converged, the basis sy input spectrum to extract mponents. The proposed tion that noise bases cannot

mal event signal spectrum.

posed abnormal acoustic event system

vent Localization ed approach is sketched in oustic event recognition step ure based abnormal acoustic veloped by Choi, is adopted

Acoustic Event zation

David K. Han Office of Naval Research Arlington, VA, Korea [email protected]

2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)

978-1-4799-4871-0/14/$31.00 ©2014 IEEE 171

in this study [7]. After recognizing theone of the pre-defined abnormal events,was determined based on the reconstrucspectrogram. Finally, source localizabased on the weighting function, whichthe direction of an abnormal acoustic ev

The GCC of the l-th and q-th microp

where is the Time Delay Of Arrivathe l-th and q-th microphones when degree. The θ that has maximum ampdetermines the direction of the sound soShort Time Fourier Transform (STmicrophone input signal in the t-th fraand denotes a weight function. Thedefined as however, is insufficient since some of may not contain any desired abnormaimproving algorithm, equation (2) can PHAT function by putting a weight

As mentioned above, in previous m

was obtained by the frequencyexcess of the average power of correfrequency domain. [3] However, due tofixed, there can be situations when high in the selected frequency bins.

To address this issue, we proposweighting function. The procedure of shown by figure 2.

Figure 2: The procedure of propo

,lq θτ

lqΨ*( , ) 1/ | ( , ) ( ,lq l qt X t X tω ω ωΨ =

lqW

( , )lqW t ω

*1( ) ( , ) ( , ) ( ,2lq lq l qR t X t X tθ ω ω ωπ

∞

−∞

= Ψ∫

( , ) ( , ) / | ( , )lq lq lt W t X tω ω ωΨ =

acoustic event into , weighting function cted abnormal event ation is performed h eventually leads to vent. hone signals is

(1)

al (TDOA) between the source is at θ

plitude of the GCC ource. is the TFT) of the l-th

ame at frequency ω, e PHAT function is

This approach, f the frequency bins al event signal. For

be developed from .

(2)

methods, the weight y bins with power in esponding frame in o the weights being noise levels present

se an NMF based proposed system is

osed system

3. Proposed algorithm

The idea in NMF is to factorwork, magnitude spectrogram oas

( , ) ( )

mag

magX t ωω

≈

≈

X DH

DH

A matrix X can be approxim

matrices D and H with dimrespectively. Matrices D and Hvectors and weight of basisnumber of basis vectors and matrix A. Each column of D cobasis and H contains the correspdetermined by the amplitudes time frame. It is well known thafor sparse representation of a dathat each source is additive, wsubscripts, S and N indicate des

There are various approaches

In this work, we apply Euclideanpopularly used. Updating rule o

Following equation (5), we

signal bases using a cleandatabase. When applying NMFafter pre-training, we modify pafor updating noise bases only.

Noise bases are initialized ran

independence of the noise sequation (6), randomly initializto represent noise components oWe assumed that noise componcannot be sparse represented byconfirmed experimentally thappropriate. Consequently, wesignal dominant spectrogram

( , )lX t ω

) | .ω

( , )t ω

ijA

mag mag mag= + ≈S NX X X

( ) , ( )

magkt

kt ktkt

H H Dω←T

T

D XD DH

SD

( ) , ( )( )

magkt

kt kt kkt

H H ω←T

NT

D X DD DH

X̂

,) e lqj dθωτω ω

*) ( , ) |qX t ω

rize the data matrix (in this of microphone input )

1

.r

t k ktk

D Hω ω=

=∑ (3)

mately factorized into two mensions , and , are referred to as the basis

s, respectively. denotes denotes the i-j element of

onstitutes a source specific ponding weight of the basis of each basis used in each t this method is quite useful atabase [8-9]. By assuming

we have equation (4) where sired signal and noise.

(4)

s to factorizing equation (3). n distance measure which is f D, H is as follows.

(5)

e pre-trained only desired n abnormal event training F to an input noisy signal arts of equation (5) into (6)

(6)

ndomly (0~1 value) for data source. Through iterating

zed noise bases are updated of input noisy spectrogram. nents of input spectrogram y desired signal bases, and hat this assumption is e can reconstruct desired

as follows:

magX

w r× r t×

r

j

[ ]⎡ ⎤

=⎢ ⎥⎣ ⎦

SS N

N

HD D DH

H

( )( )

magk

k kk

D ωω ω

ω

←T

T

X HDHH

( )( )( )

magk

k kk

ωω

ω

←TN

N TN N N

X HDD H H

magSX ˆ mag =S S SX D H

172

(matrix multiplication of desired signal bases and denoting amplitude of each desired signal basis used in each time frame).

In previous research [9], they modified equation (6) to updating , separately, but in this work we observed experimentally that equation (6) performs better. There had been some previous work of factorization methods based on multi-microphone [8]. However in this work, we use average magnitude of spectrograms from microphones for simple NMF, since number of sources and transfer function between each microphone and source have little significance for finding desired signal dominant frequency bins. Finally, we derived GCC weight by normalizing elements of reconstructed spectrogram

.

(7)

4. Experiments For the purpose of evaluation, we used street

environmental noise for surveillance camera system. We consider the direction of abnormal event as that corresponding to the highest GCC value. The room for the experiment is 10m x 10m x 10m and a Uniform Linear Array (ULA) consisting of 4 microphones with inter-spacing distance of 20cm to each other is located at the center of the room. The database was created with a desired abnormal event source at DOA of 30˚ at 1m from

the ULA and stationary street noise sources [10] at 60˚ and

non-stationary noise sources (car horn [11]) at 0˚. We set the same two abnormal events and the database from Choi [7] as shown in Table 1. The test database composed of noise level corresponding to SNR 0, 5, 10dB, respectively, in 16kHz sampling rate were used here. We used the abnormal event of the data base for the pre-training and division rate for train/test set was 3:1.

The performance was evaluated with two metrics, namely Root Mean Squared Error (RMSE) and Probability Of Success (POS). Each utterance was considered a success if the estimated DOA is within an angular tolerance of 10˚. Figure 7-10 show the direction estimation performance in terms of RMSE and POS under each experimental environment. It shows that the proposed method attained excellent performances under the non-stationary noise environment. In stationary noise condition, the performance is shown comparable. The conventional method shows good performance in high SNR, but the performance drops when SNR is lowered, especially in non-stationary noise environment. This is because the

conventional methods used not only desired signal dominant frequency bands but also inadvertently captured noise concentrated frequency bands.

Figure 3: Direction estimation performance in terms of RMSE and

POS (Scream in Street [10] environment)

Figure 4: Direction estimation performance in terms of RMSE and

POS (Glass breakage in Street [10] environment)

SD SH

SH NH

( , )lqW t ω

ˆ ( , )magSX t ω

ˆ ( , )( , ) ˆmax(( ) )

magS

lq magS t

X tW tω

ωω =X

Method Total

Number Total

Duration [sec]

Scream (Male, Female) 414 888

Breakage of Glass 127 182

Table 1. Abnormal acoustic event DB information.

173

Figure 5: Direction estimation performance in terms of

RMSE and POS (Scream in Street [10] + Car horn [11])

Figure 6: Direction estimation performance in terms of RMSE and POS (Glass breakage in Street [10] + Car horn

[11])

5. Conclusions This paper proposes a noise robust sound source

localization using an NMF based weighting function. A high level of DOA accuracy was possible by establishing and incorporating weights of desired abnormal event frequency bins over those bins dominated by noise. The direction estimation experiments confirm that the proposed method is robust under surveillance camera environment, and expected to perform well in non-stationary noise environment.

6. Acknowledgements This research was supported by Seoul R&BD Program

(WR080951).

References [1] G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antonacci, and

A. Sarti, “Scream and gunshot detection and localization for audio-surveillance systems” in Advanced Video and Signal Based Surveillance (AVSS) 2007. IEEE Conference on, pp. 21–26, 2007

[2] S. Shon, E. Kim, J. Yoon, and H. Ko, “Sudden noise source localization system for intelligent automobile application with acoustic sensors” in Consumer Electronics (ICCE), 2012 IEEE International Conference on, pp. 233–234, 2012.

[3] S. Shon, D. K. Han, and H. Ko, “Abnormal acoustic event localization based on selective frequency bin in high noise environment for audio surveillance” in Advanced Video and Signal Based Surveillance (AVSS) 2013. IEEE Conference on, 2013, pp. 87–92.

[4] S. Shon, D. K. Han, J. Beh, and H. Ko, “Full Azimuth Multiple Sound Source Localization wigh 3-Channel Microphone Array,” IECE Trans. on Fundamentals, vol. E95-A, no. 4, pp. 745–750, 2012.

[5] Y. Denda, T. Nishiura, and Y. Yamashita, “Robust Talker Direction Estimation Based on Weighted CSP Analysis andMaximum Likelihood Estimation,” IEICE Trans. on Information and Systems, vol. E89-D, no. 3, pp. 1050–1057, 2006.

[6] O. Ichikawa, T. Fukuda, and M. Nishimura, “DOA Estimation with Local-Peak-Weighted CSP,” EURASIP Journal on Advances in Signal Processing, vol. 2010, pp. 1–10, 2010.

[7] W. Choi, J. Rho, D. K. Han, and H. Ko, “Selective Background Adaptation Based Abnormal Acoustic Event Recognition for Audio Surveillance,” 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance, pp. 118–123, Sep. 2012.

[8] A. Ozerov, et al., “Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation”, IEEE Trans. Acoust. Speech. Signal Process., 18(3): pp. 550-563, 2010.

[9] M Schmidt, et al., “Wind noise reduction using non-negative sparse coding”, Proc. Workshop on MLSP, pp.431-346, August 2007.

[10] ETSI : EG 202 396-1 V1.2.2 , September 2008. [11] Sound Ideas, http://www.sound-ideas.com/

174

Documents

Generalized Cross-Corr elation based Noise Robust Abnormal ... · normalizing elements of reconstructed spectrogram . (7) 4. Experiments For the purpose of evaluation, we used street