JOURNAL OF LA Multi-source DOA Estimation In A Reverberant

JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MONTH YEAR 1

Multi-source DOA Estimation In A ReverberantEnvironment Using A Single Acoustic Vector

SensorKai Wu, Vaninirappuputhenpurayil Gopalan Reju, Member, IEEE, Andy W. H. Khong, Member, IEEE

Abstract—We address the problem of direction-of-arrival(DOA) estimation for multiple speech sources in an enclosedenvironment using a single acoustic vector sensor (AVS). Thechallenges in such scenario include reverberation and overlappingof the source signals. In this work, we exploit low-reverberant-single-source (LRSS) points in the time-frequency (TF) domain,where a particular source is dominant with high signal-to-reverberation ratio. Unlike conventional algorithms having limi-tation that such potential points need to be detected at “TF-zone”level, the proposed algorithm performs LRSS detection at “TF-point” level. Therefore, for the proposed algorithm, the potentialLRSS points need not be neighbors of each other within a TFzone to be detected, resulting an increased number of detectedLRSS points. The detected LRSS points are further screenedby an outlier removal step such that only reliable LRSS pointswill be used for DOA estimation. Simulations and experimentswere conducted to demonstrate the effectiveness of the proposedalgorithm in multi-source reverberant environments.

Index Terms—DOA estimation, multiple sources, acoustic vec-tor sensor, reverberation

I. INTRODUCTION

Direction-of-arrival (DOA) estimation of acoustic sources isan important topic in signal processing due to its widespreadapplications including automatic camera steering, beamform-ing, robotics and surveillance [1]–[3]. The presence of rever-beration and background noise pose challenges that need tobe addressed in a realistic environment. Furthermore, DOAestimation of multiple and simultaneously active sources inan adverse environment is still an open problem. ConventionalDOA estimation approaches often employ an array of omni-directional microphones where the inter-microphone time-delay information is exploited [4]. These systems often requirelarge aperture size which limits their use in space-constrainedapplications. An acoustic vector sensor (AVS) [5], on the otherhand, consists of a single monopole pressure sensor elementco-located with three orthogonally oriented dipole elements asshown in Fig. 1 [6]. Unlike conventional arrays which requiremultiple microphones with inter-element spacing, a singleAVS can achieve spatial filtering with a compact configurationdue to its spatial response [5], [6].

Kai Wu is with the Institute for Infocomm Research, Agency for Science,Technology and Research, Singapore (e-mail: [email protected]).

V. G. Reju1 and Andy W. H. Khong are with the School of Electrical andElectronic Engineering, Nanyang Technological University, Singapore (Email:[email protected], [email protected]).

1V. G. Reju recently moved to the Department of Instrumentation, CochinUniversity of Science and Technology, India (Email: [email protected]).

Omni-directional

microphone signal

Three orthogonal

directional microphone

signals

Fig. 1: An example of an acoustic vector-sensor which consists of an omni-directional microphone and three orthogonal directional microphones in x, yand z directions, respectively. (fabricated by Microflown, source: [6].)

Consider an ideally constructed AVS with frequency-independent array response positioned at the origin of aCartesian coordinate system. The spatial response (manifold)of the sensor can be expressed as

q =

1

cosψ cosφcosψ sinφ

sinψ

, (1)

where the first element denotes the response of the omni-directional microphone that is invariant to the source DOA.The remaining three elements denote the directional responseof the orthogonal microphones in the x, y and z directions,respectively. The variables φ ∈ (−π, π] and ψ ∈ [−π/2, π/2]are defined, respectively, as the azimuth and elevation an-gles pointing towards the source direction. Unlike a linearmicrophone array in which the array response is frequencydependent, the response of an AVS is a function of the sourcedirection only [7]. This implies that signals can be processedacross all frequency bins for an AVS making it attractive forwideband signals such as speech.

In the context of single-source DOA estimation using anAVS, the Cramer-Rao lower bound (CRLB) for a free-spacescenario with Gaussian additive noise has been derived [5].The intensity-vector and velocity-covariance DOA estimatorshave been proposed in [5]. In [8], a steered response power (S-RP) estimator was proposed for single-source DOA estimationand this estimator has been shown to be a generalized form ofthe intensity-vector and velocity-covariance estimators. In [9],a maximum likelihood estimator was proposed and the CRLBwas attained through the use of Gaussian noise statistics.In addition, beamforming and subspace methods have been


proposed for an AVS array [10]–[13].In this work, we consider the problem of using a single

AVS for multi-source DOA estimation in a reverberant envi-ronment. This is still an open problem since the conventionalmultiple signal classification (MUSIC) algorithm [14] requiresmore sensors than the number of sources. Therefore, withonly three sensor-elements having directional response in anAVS, the maximum number of sources is restricted to two.In addition, the aforementioned algorithms assume a free-space model [5], [8]–[14] and performance is limited undera reverberant environment [15]. To address the limitation ofthe MUSIC algorithm, algorithms based on single-source point(SSP) have become popular in recent studies [16]–[21]. Giventhe fact that speech signals are sparse in the time-frequency(TF) domain, these algorithms first detect TF points corre-sponding to a single-dominant source. The intensity vectorsthat contain source-direction cues can then be computed onlyfor these detected single-source points [17]–[20], [22]. TheseTF points are usually clustered based on the direction cuescorresponding to different sources and single-source DOAestimator can then be applied on each cluster [18], [21], [23].To improve clustering performance, the use of mixing-vectorcues along with expectation-maximization algorithms has beenproposed [24]. It has also been reported in [25] that real-timelocalization can be achieved when detection and clustering areperformed within a signal-block length of one second.

To improve performance in a reverberant environment,the idea of SSP has further been extended to the detectionof SSPs that are less affected by reverberation [18], [20],[23]. Such low-reverberant-single-source (LRSS) points can bedetected by exploiting the frequency-independence propertyin the array response which, in general, can be found inspherical microphone arrays [18] and the AVS [23]. Dueto this property, as shown in Fig. 2, time- and frequency-smoothing can be applied across neighboring TF points withina TF zone, and a TF zone with low covariance rank willbe considered as a zone of LRSS points. It was shownin [23] that DOA estimation was improved compared to theMUSIC [14] and SSP-based [21] algorithms in noisy andreverberant environments. This approach, however, still suffersfrom performance degradation with increasing reverberation.This is because the LRSS detection is performed at “zone”level — only all neighboring points within a TF zone thatsatisfy the LRSS condition will result in a successful detection.The number of such TF zones will decrease with increasingreverberation [17], [23].

To address the above problem, in this work, we propose toperform LRSS detection at “point” level instead of “zone” lev-el, as shown in Fig. 2. This implies that TF points that satisfythe LRSS condition need not be neighbours of each other tobe detected. As a result, the number of detectable LRSS pointsis increased, resulting in higher DOA estimation accuracy. Toachieve LRSS detection at point level, a new algorithm hasto be formulated as opposed to the direct extension of [23].This is because [23] is based on the covariance of multipleneighbouring TF points and is therefore not applicable forindividually-separated TF points as in the current case. Toachieve LRSS detection at point level, the proposed LRSS-

Time frame index

Freq

ue

ncy

bin

ind

ex

Detection of LRSS points at “zone” level [23]

Detection of LRSS points at “point” level

Freq

. sm

oo

thin

g

Time smoothing

Fig. 2: Illustration of difference between the “zone” level [23] versus theproposed “point” level for LRSS points detection. For zone-level detection,only all the neighboring points within a TF zone satisfy the LRSS conditionwill result in a successful detection. For the proposed point-level detection,the TF points satisfying the LRSS condition need not be the neighbours ofeach other to be detected.

Point algorithm exploits the fact that the real and imaginaryparts of an LRSS point have similar absolute direction, whilenon-LRSS points have different absolute directions due tomulti-path and/or multi-source effects. The identified LRSSpoints are then clustered based on intensity vector that containsdirection cues. DOA estimation is finally applied on eachcluster. Compared to [26], [27], where TF sparsity has alsobeen exploited for the AVS, the proposed algorithm has itsadvantages — compared to [26] where free-space is assumed,the proposed algorithm is derived based on the reverberantsignal model. As opposed to [27] where deep neural networkis trained to estimate the TF mask, the proposed algorithmdoes not require training process.

The organization of this paper is as follows: the AVS re-ceived signal model is formulated in Section II. The proposedLRSS point detection is described in Section III. Section IVdescribes the remaining procedures of the proposed algo-rithm including clustering, source-number estimation, outlierremoval and DOA estimation. In Section V, the algorithmis validated through both simulations and experiments, andSection VI concludes the paper.

II. RECEIVED SIGNAL MODEL

Assuming I number of active sources in a reverberantenvironment, the signal received by an AVS with one omni-directional and three directional elements co-located at theorigin can be modeled as [15][

xo[n]xd[n]

]=

I∑i=1

si[n] ∗[ho,i[n]hd,i[n]

]+

[vo[n]vd[n]

], (2)

where xo[n] and xd[n] = [xdx[n], xdy

[n], xdz[n]]T are the

omni-directional and three-directional element outputs, respec-tively. The variable n is the discrete-time index, si[n] is theith source signal, ∗ is the convolution operator, ho,i[n] is theimpulse response from the ith source to the omni-directionalpressure element and hd,i[n] = [hdx,i[n], hdy,i[n], hdz,i[n]]T

is a 3 × 1 impulse response vector from the ith sourceto the directional elements. The variables vo[n] and vd[n]


denote additive noise associated with the monopole and dipoleelements, respectively. For ease of representation, (2) can bere-written as

x[n] =

I∑i=1

si[n] ∗ hi[n] + v[n], (3)

where x[n] = [xo[n],xTd [n]]T , hi[n] = [ho,i[n],hTd,i[n]]T andv[n] = [vo[n],vTd [n]]T . Since we are processing the signals inthe TF domain, (3) can be expressed as [28]

x(k, l) =

I∑i=1

L∑l′=0

hi(k, l′)si(k, l− l′) +e(k, l) +v(k, l), (4)

where x(k, l) denotes the 4 × 1 short-time Fourier transform(STFT) coefficient vector of the received signal. The indicesk and l denote the frequency-bin index and frame index,respectively. The vector hi(k, l

′), l′ = 0 . . . L models theacoustic transfer function (ATF) between the ith source andthe sensor elements, where L denotes the index for the lastframe of the impulse response. The variable si(k, l) is theSTFT coefficient of the ith source signal, e(k, l) denotes themodelling error between the approximated TF domain multi-plication in (4) and the time-domain convolution in (3) [28],and v(k, l) denotes the 4 × 1 STFT coefficient vector ofthe noise signal. In this paper, we denote variables with anunderline as that corresponding to the TF domain. In (4),the time-domain convolution is approximated by a recursivemultiplication across 0 ≤ l′ ≤ L for each frequency bin k inthe TF domain. In contrast to [23], this approximation is basedon the setting that the STFT frame length is much shorterthan the length of impulse response and hence reverberantcomponents of the previous signal frames are added onto thecurrent signal frame [28].

With reference to (4), if we consider a short frame length(typically 8 ms) such that the first frame of the ATF vectorcontains mainly the direct-path component, the ATF vector attime frame l′ = 0 and frequency bin k is given by

hi(k, 0) ≈ e−ωkτiqi, (5)

where the 4 × 1 real vector qi = [1, uTi ]T , ui =[cosψi cosφi, cosψi sinφi, sinψi]

T defines the sensor man-ifold pointing towards the ith source, and that φi and ψi denotethe azimuth and elevation angle of the direct-path component,respectively. The variable τi denotes the sample delay of thedirect-path impulse with respect to the first sample of the 0thframe and ωk = 2πk/K is the discrete angular frequencywith K being the fast-Fourier transform (FFT) size. Note thatin practice, due to variation of room acoustics and difficulty inchoosing an optimal frame length, the assumption where thefirst frame contains only the direct-path component may notalways hold, and a solution for this problem will be discussedin Section IV-C for the proposed algorithm.

As opposed to the direct-path component, subsequentframes in hi(k, l

′), l′ ≥ 1 contain only reflected compo-nents. Denoting indices of such reflections within a frame byr = 1, · · · , Nl′ , where Nl′ is the total number of reflections

STFT

LRSS Point Detection(Section III)

DOA Estimation(Section IV-D)

Source-number Estimation

(Section IV-B)

Outlier Removal(Section IV-C)

A set of LRSS points

A new set of LRSS points

Clustering(Section IV-A)

Fig. 3: Block diagram of the proposed LRSS-Point DOA estimationalgorithm.

in the l′th frame, the ATF vector can be written as

hi(k, l′) =

Nl′∑r=1

α(l′,r)i e−ωkτ

(l′,r)i q

(l′,r)i . (6)

The 4× 1 real vector q(l′,r)i = [1, u

(l′,r)i

T]T , where u

(l′,r)i =

[cosψ(l′,r)i cosφ

(l′,r)i , cosψ

(l′,r)i sinφ

(l′,r)i , sinψ

(l′,r)i ]T de-

fines the manifold pointing towards the rth reflection in thel′th frame, φ(l′,r)

i and ψ(l′,r)i are the corresponding incident

angles, τ (l′,r)i is the sample delay of the corresponding impulse

with respect to the first sample of the frame and α(l′,r)i is

the attenuation due to absorption during reflections. Giventhe AVS received signal model in (4)-(6), the objective is toestimate ui, i = 1, . . . , I , in (5) which, in turn, provide DOAestimates of the sources.

III. PROPOSED LRSS POINT DETECTION

As shown in Fig. 3, the proposed algorithm consists of thefollowing steps: LRSS point detection, source-number esti-mation, clustering, outlier removal and final DOA estimation.The key to the success lies with the LRSS point detection andoutlier removal, which will be discussed in this section andSection IV-C, respectively.

Given a reverberant environment1, we will first discuss,in Sec. III-A, which type of TF points satisfies the single-source condition with high signal-to-reverberation ratio. Thistype of TF points corresponds to the desired LRSS points thatthe algorithm is designed to detect. Section III-B discussesthe undesired TF points contaminated by reverberation, andSection III-C discusses the undesired TF points containing

1If the environment is anechoic/low-reverberant, hi(k, l′), l′ ≥ 1 for (7)

and therefore all the single-source points will satisfy the low-reverberantcondition, which will be detected by the LRSS detection rule in (21).


multiple sources. Through the analysis, we will explore thedifference between the desired LRSS points with the remainingpoints, based on which, the LRSS point detection rule will bedeveloped in Section III-D.

A. LRSS points

Based on sparsity in speech [21], consider a TF point wherethe energy of the ith source is much more significant than theother sources, i.e., ||si(k, l)|| � ||sj(k, l)||, ∀j 6= i. For suchSSP, by neglecting the modelling error and noise, (4) can beapproximated as

x(k, l) ≈L∑l′=0

hi(k, l′)si(k, l − l′)

= hi(k, 0)si(k, l) +

L∑l′=1

hi(k, l′)si(k, l − l′), (7)

where hi(k, 0) and hi(k, l′), l′ ≥ 1 have been defined in (5)

and (6), respectively. The first term of (7) corresponds to thedirect-path component, and the second term corresponds to thereflected component for this SSP. Furthermore, if the energyof the preceding TF points is less significant than this single-source point, i.e., ||si(k, l− l′)|| ≈ 0, ∀1 ≤ l′ ≤ L, the secondterm of (7) will approach zero and

x(k, l) ≈ hi(k, 0)si(k, l)

= si(k, l)e−ωkτiqi, (8)

implying that, for this TF point, the effect of reflections canbe neglected and x(k, l) contains contribution of single-sourcewith high signal-to-reverberation ratio.

For such an LRSS point, by noting that x(k, l) is a complexvector, we equate the real and imaginary parts of (8) to give

R{x(k, l)} = R{si(k, l)e−ωkτi}qi, (9)I{x(k, l)} = I{si(k, l)e−ωkτi}qi, (10)

where R(·) and I(·) denote the real and imaginary partsof a vector/variable, respectively. From (9) and (10), it isimportant to note that R{x(k, l)} and I{x(k, l)} contain thesame manifold vector qi scaled by different real numbers.This implies that, for an LRSS point, the absolute directionsof R{x(k, l)} and I{x(k, l)} are the same and that theycorrespond to the manifold towards the ith source qi.

B. TF points contaminated by reverberation

We next consider a TF point with preceding points beingnon-zero, i.e., ||si(k, l− l′)|| 6= 0, ∀1 ≤ l′ ≤ L. Different fromthe previous case, now the second term of (7) will not approachzero, i.e.,

∑Ll′=1 hi(k, l

′)si(k, l−l′) 6= 0. This implies that thisTF point is affected by reverberation. Substituting (5) and (6)

into (7), we have

x(k, l) =hi(k, 0)si(k, l) +

L∑l′=1

hi(k, l′)si(k, l − l′)

=si(k, l)e−ωkτiqi

+

L∑l′=1

Nl′∑r=1

si(k, l − l′)α(l′,r)i e−ωkτ

(l′,r)i q

(l′,r)i .

(11)

Equating the real and imaginary parts, we obtain

R{x(k, l)} =R{si(k, l)e−ωkτi}qi

+

L∑l′=1

Nl′∑r=1

R{si(k, l − l′)α(l′,r)i e−ωkτ

(l′,r)i }q(l′,r)

i ,

(12)I{x(k, l)} =I{si(k, l)e−ωkτi}qi

+

L∑l′=1

Nl′∑r=1

I{si(k, l − l′)α(l′,r)i e−ωkτ

(l′,r)i }q(l′,r)

i ,

(13)

where the manifolds for reflections q(l′,r)i generally vary with

respect to each other and differs from the manifold for direct-incidence qi.

From (12) and (13), it can be observed that the absolutedirections of R{x(k, l)} and I{x(k, l)} will be the same ifand only if the condition

R{si(k, l)e−ωkτi}I{si(k, l)e−ωkτi}

=R{si(k, l − 1)e−ωkτ

(1,1)i }

I{si(k, l − 1)e−ωkτ(1,1)i }

= . . .

=R{si(k, l − L)e−ωkτ

(L,NL)

i }I{si(k, l − L)e−ωkτ

(L,NL)

i }(14)

is satisfied, i.e., the ratio of the real to imaginary parts of thedirect-path component si(k, l)e

−ωkτi and that of the reflectedcomponents si(k, l−l′)e−ωkτ

(l′,r)i , where r = 1, . . . , Nl′ , l

′ =1, . . . , L, must be the same. In practice, since si(k, l − l′) 6=si(k, l) and τ

(l′,r)i 6= τi, this condition is not expected to be

satisfied particularly for large number of acoustic reflectionsNL. Therefore, different from LRSS points, the absolute direc-tion of R{x(k, l)}, in general, differs from that of I{x(k, l)}for a TF point contaminated by reverberation.

C. TF points containing multiple sources

We finally consider a TF point containing multiple sources.With number of source I ≥ 2, and to simplify the analysis,||si(k, l− l′)|| ≈ 0, ∀1 ≤ l′ ≤ L is assumed for this TF point,(4) can then be rewritten as

x(k, l) =

I∑i=1

si(k, l)hi(k, 0)

=

I∑i=1

si(k, l)e−ωkτiqi, (15)


where qi generally differ across the I sources. Equating thereal and imaginary parts of (15), we obtain

R{x(k, l)} =

I∑i=1

R{si(k, l)e−ωkτi}qi, (16)

I{x(k, l)} =

I∑i=1

I{si(k, l)e−ωkτi}qi. (17)

From (16) and (17), it can be observed that the absolutedirections of R{x(k, l)} and I{x(k, l)} will be the same ifand only if the condition

R{s1(k, l)e−ωkτ1}I{s1(k, l)e−ωkτ1}

=R{s2(k, l)e−ωkτ2}I{s2(k, l)e−ωkτ2}

= . . .

=R{sI(k, l)e−ωkτI}I{sI(k, l)e−ωkτI}

(18)

is satisfied. It has been shown in [29] that this condition isunlikely to be satisfied with increasing I since, in general,s1(k, l) 6= s2(k, l) 6= . . . 6= sI(k, l) and τ1 6= τ2 6= . . . 6= τI .This implies that, unlike LRSS points, the absolute directionof R{x(k, l)}, in general, differs from that of I{x(k, l)} for aTF point corresponding to multiple sources.

D. LRSS point detection rule

Based on the three cases described above, we can summa-rize that the absolute directions of R{x(k, l)} and I{x(k, l)}will be the same for LRSS points, while different for non-LRSS points. Therefore, to determine whether a TF pointsatisfies the LRSS condition, the angular spacing betweenR{x(k, l)} and I{x(k, l)} can be computed as

θ(k, l) = cos−1

{R{xT (k, l)}I{x(k, l)}‖R{x(k, l)}‖‖I{x(k, l)}‖

}(19)

and we have θ(k, l) = 0◦ or 180◦ for an LRSS point basedon (9) and (10), while θ(k, l) 6= 0◦ and θ(k, l) 6= 180◦ for anon-LRSS point due to (12)-(13) and (16)-(17).

In practice, due to the presence of v(k, l) and e(k, l) in (4),only θ(k, l) ≈ 0◦ or 180◦ can be achieved for (19) for anLRSS point. In addition, without apriori information pertainingto the room dimension or the existence of any reflectorsbetween the source and sensor, the frame length cannot beoptimally chosen to ensure that hi(k, 0) contains only direct-path component and none of the early reflections. We thereforeintroduce a pre-defined threshold θthr such that any TF pointsatisfying the condition

θ(k, l) ≤ θthr or θ(k, l) ≥ 180◦ − θthr (20)

will be considered as an LRSS point for DOA estimation.Combining (19) and (20), we have the following:

LRSS Point Detection Rule: A TF point satisfying∣∣∣∣ R{xT (k, l)}I{x(k, l)}‖R{x(k, l)}‖‖I{x(k, l)}‖

∣∣∣∣ ≥ cos(θthr) (21)

will be detected as an LRSS point.

Figure 4 shows the detected LRSS points for an examplecase of two speech sources. The original speech signals areshown in Figs. 4 (a) and (b), while Fig. 4 (c) illustratesthe received signal xo[n]. In this example, synthetic impulseresponse was used [30] with T60 = 300 ms and the actualsource directions were set as φ1 = 110◦, ψ1 = −10◦ andφ2 = 170◦, ψ2 = 15◦. The other configurations were kept thesame as in Section V. Figure. 4 (d) shows the detected LRSSpoints using (21) with θthr = 7◦. Given the fact that a correctlydetected LRSS point would have intensity vector along withthe actual source direction (which will be shown in (23)), wemark the points as correctly or wrongly detected for Fig. 4 (d)in the following way: if a point has intensity vector (computedusing the first line of (23)) along with actual source directionwithin a tolerable angular difference, this point is marked ascorrectly detected and will be denoted by dark- and lightly-shaded circles; Otherwise, this point is marked as wronglydetected (outlier) and will denoted by crosses. For a similarenvironment shown in Section V, since the DOA estimator hasa maximum angular error of 5◦, the aforementioned tolerableangular difference was set as 5◦ for this plot.

It can be observed in Fig. 4 (d) that most of the points arecorrectly detected LRSS points. Comparing with Figs. 4 (a)and (b), these correctly detected LRSS points generally cor-respond to TF points with preceding TF points being zerosas what the algorithm has been designed to achieve. Theoutliers, denoted by crosses, are the wrongly detected TFpoints due to use of empirically determined threshold θthr

in (21). These wrongly detected points will be addressed bythe outlier removal step in Section IV-C.

E. Further discussions for the LRSS point detection

The proposed LRSS detection rule is inspired by the in-trinsic structure of AVS where orthogonal sensor elementsare co-located. As shown in (9), the direct-path componentof the received signal has the same phase delay τi across allthe four sensor elements, giving rise to a common scalingfactor R{si(k, l)e−ωkτi} across all four elements within thevector R{si(k, l)e−ωkτi}qi. This holds similarly for (10), andfinally, R{x(k, l)} and I{x(k, l)} adopt the same absolutedirections for an ideal LRSS point.

As a comparison, for a conventional linear array consistingof P microphones, the direct-path component of the receivedsignal is described by xlin(k, l) = hlin

i (k, 0)si(k, l), wherethe superscript “lin” denotes for the linear array, hlin

i (k, 0) =[e−ωkτi,1 , e−ωkτi,2 , · · · , e−ωkτi,P ]T is the direct-path com-ponent of the impulse response and τi,p is the time delayfrom the ith source to the pth microphone. We note that, thetime-delay τi,p varies for different microphones, resulting indifferent phases for each element in hlin

i (k, 0). As a result,even for an LRSS point, the real and imaginary parts ofxlin(k, l) adopt different absolute directions. Therefore the co-location of sensor elements within the AVS has been exploitedby the proposed LRSS detection rule which, in turn, made thedetection particularly suitable for the AVS.


0 0.5 1 1.5 2 2.5 3 3.5

Fre

quen

cy (

Hz)

0

5000

0 0.5 1 1.5 2 2.5 3 3.5

Fre

quen

cy (

Hz)

0

5000

Time(sec)0 0.5 1 1.5 2 2.5 3 3.5

Fre

quen

cy (

Hz)

0

5000

Frame index k

0 100 200 300 400 500 600 700 800 900Fre

quen

cy b

in in

dex

20406080

100120

Correctly detected LRSS points (1st source) Correctly detected LRSS points (2nd source) Outliers

(b)

(a)

(c)

(d)

Fig. 4: (a) Original signal of the first speech source. (b) Original signal of the second speech source. (c) Received signal at the omni-directional sensorelement. The received signal was generated using simulated impulse response with T60 = 300 ms and SNR=20 dB. Two speech sources in the upper figureswere simultaneously present for the whole duration. (d) The detected LRSS points using (21). The dark- and lightly-shaded circles denote the correctlydetected LRSS points corresponding to the first and second sources, respectively. The crosses denote the wrongly detected LRSS points (outliers).

IV. THE PROPOSED ALGORITHM

A. Clustering based on intensity vector

Given a set of detected LRSS points {x(k, l)}, clustering isto be performed such that points can be grouped into clusterscorresponding to different sources. In this work, clustering ofLRSS points is achieved based on intensity vector that containssource DOA cue. More specifically, the vector form of anLRSS point in (8) can be rewritten as[

xo(k, l)xd(k, l)

]≈ si(k, l)e−ωkτ

[1ui

]. (22)

The intensity vector for this point can be computed as

λ(k, l) = x∗o(k, l)xd(k, l)

= |si(k, l)|2ui, (23)

where (·)∗ denotes the conjugate operator. It can be observedthat λ(k, l) contains the source direction cue ui with magni-tude being the source energy. Therefore, λ(k, l) will be usedas feature for the clustering of LRSS points.

To cluster the LRSS points having similar λ(k, l) intothe same group, we propose to use angular spacing betweenintensity vectors as similarity measure. Given two intensityvectors λ(k1, l1) and λ(k2, l2), this distance is defined as

D = 1− cosϕλ, (24)

where

cosϕλ =λT (k1, l1)λ(k2, l2)

‖λ(k1, l1)‖‖λ(k2, l2)‖. (25)

Therefore, for any two LRSS points with λ(k, l) pointingtowards a similar direction, D → 0 implying that these twopoints belong to the same cluster. With D, any well-establishedclustering algorithms [31], [32], such as k-means [33] or fuzzyc-means (FCM) [34] can be used. Here, we employ the FCMalgorithm which gives a membership function that is inverselyrelated to the distance from each cluster centroid. Therefore,defining i as the cluster (and hence the source) index, theobtained membership functionMi(k, l) serves as a soft maskfor the point.

B. Number of sources estimation

Clustering algorithms usually require one to specify thenumber of clusters (sources). To estimate the source number,and similar to [25], the method of counting significant peaksin a histogram can be used. By converting λ(k, l) into anazimuth-elevation form (φλ, ψλ), Fig. 5 (a) shows the nor-malized histogram of (φλ, ψλ) for all detected LRSS pointsin a simulated room environment with T60 = 550 ms andSNR = 10 dB. For this illustrative example, the bin width ischosen as 5◦ across both azimuth and elevation. The circlesindicate the actual source directions. It can be observed thatthe intensity vectors exhibit high densities at both the source


(a) Normalized histogram

300

Azimuth (deg)

200100

0Elevation (deg)-50

050

0

0.5

1

1.5

(b) Normalized smoothed distribution

Fig. 5: Normalized histogram and normalized smoothed distribution forthe intensity vector (φλ, ψλ) of all the detected LRSS points. The roomenvironment is simulated with T60 = 550 ms and SNR = 10 dB. The soliddots indicate the actual source directions positioned at φ1 = 110◦, ψ1 =−10◦ and φ2 = 170◦, ψ2 = 15◦. For (a), bin width is chosen as 5◦ acrossboth azimuth and elevation.

directions, albeit the fact that the density for the first sourcespreads due to errors introduced by reverberation and noise.Estimation for the number of clusters based on this histogramis undesirable.

To address this problem, similar to [25], we propose toapply smoothing over the histogram. Denoting the count at aparticular bin (φd, ψd) is as C(φd, ψd), the smoothed densitycan be obtained by

Csmth(φd, ψd) =

N∑a=−N

N∑b=−N

h(a, b)C(φd − a, ψd − b), (26)

where h(a, b) = 12πσ2 exp −(a2+b2)

2σ2 is a 2D Gaussian filterwith standard deviation σ = 1.5 and number of steps N = 9.Figure 5 (b) shows the corresponding Csmth(φd, ψd) for Fig. 5(a). It can be observed that the density for the first source iscomparatively higher than that of the original histogram dueto the summation of counts in neighboring bins. This makescontributions of both sources comparable in the smootheddistribution. Estimation for the number of sources can then beachieved by counting the number of significant local maximain the smoothed distribution. In this work, the number of localmaxima which are higher than 20% of the maximum in thesmoothed distribution is used as source-number estimate.

C. Outlier removal

An example clustering result is shown in Fig. 6 (a) forthe detected LRSS points in Fig. 4. Each circle correspondsto the intensity vector (source DOA cue) associated with adetected LRSS point. The distance of each circle from theorigin is proportional to the source energy. The dark solid linesrepresent actual DOAs of the two sources. It can be observedthat these LRSS points are grouped into two clusters denotedby the dark- and lightly-shaded circles. Although a number ofoutliers exist, majority of the LRSS points lie along the actualsource DOA for each cluster.

The existence of outliers is due to the use of an empiricallydetermined threshold θthr in (21). To reduce these outliers,points that are away from each cluster centroid will beremoved from the LRSS point set {x(k, l)}. More specifically,given that the centroid of the ith cluster points towards(φi, ψi), any point within that cluster while pointing away

0.04

Y-axis0.02

00

-0.02X-axis

-0.04

0

-0.05

0.05

Z-a

xis

Before outliter removal:% outlier = 30.5%

(a) Before outlier removal

0.04

Y-axis0.02

00

-0.02X-axis

-0.04

-0.05

0

0.05

Z-a

xis

After outlier removal:% outlier = 3.1%

(b) After outlier removal

Fig. 6: Scatter diagram of the intensity vectors (source DOA cues) associatedwith the detected LRSS points for two speech sources. The room environmentis simulated with T60 = 300 ms and SNR=20 dB. The dark lines representthe actual source DOAs. The dark- and lightly-shaded circles represent thedetected LRSS points corresponding to the 1st and 2nd clusters, respectively.(a) before outlier removal; (b) after outlier removal.

from that direction, i.e.,

]{

(φλ, ψλ), (φi, ψi)}> ϕfltr (27)

will be removed. In (27), (φλ, ψλ) denotes the intensity vectorof the point in the azimuth-elevation form, ]{a, b} denotesthe inter-angle between directions a and b, and ϕfltr is a pre-defined filtering threshold. In this work, ϕfltr = εσi is used,where σi is the standard deviation of the angular spacingbetween points within the ith cluster, and ε is a pre-definedconstant.

Table I shows a relationship between choice of ε andthe percentage of the remaining outliers. For this illustrativeexample, 31% of the detected points are outliers, as can beseen from the case where no outlier removal is invoked. Ingeneral, a smaller value of ε results in more outliers beingremoved, which, in turn, results in a lower DOA estimationerror. However, an extremely small value of ε (e.g., 0.1) resultsin an increase in DOA estimation error. This is caused bythe aggressive removal process resulting in the removal of“good” LRSS points. This example suggests that a choice ofε between 0.5 and 1.0 is desirable, and hence, ε = 0.5 is usedfor subsequent simulations and experiment. Figure 6 (b) showsthe result of outlier removal for the points shown in Fig. 6 (a).It can be observed that the outliers, which appeared betweenactual source DOAs in Fig. 6 (a), have now been removed,and the remaining points exhibit better directional patterns inFig. 6 (b). The percentage of the outliers has been reducedfrom 30.5% to 3.1% for this case.

D. DOA estimation

Given a set of clustered LRSS points {x(k, l)} with outliersremoved, single-source DOA estimators such as velocity-


TABLE I: Choice of different constant ε with the resultant percentage ofthe remaining outliers and DOA estimation error e for a simulated roomenvironment with T60 = 300 ms and SNR=20 dB.. Two sources are placedat φ1 = 110◦, ψ1 = −10◦ and φ2 = 170◦, ψ2 = 15◦.

Constant 𝜀 0.1 0.5 1.0 1.5 2.0 3.0 *

Percent. of

remaining

outliers

0.1% 3.1% 5.8% 10% 19% 27% 31%

DOA

estimation

error

2.3∘ 1.6∘ 1.6∘ 2.5∘ 2.9∘ 4.0∘ 4.1∘

* No outlier removal

covariance based [5] or SRP based [8] can now be appliedon each cluster. We employ a MUSIC estimator [14] due toits simplicity and minimal parameter tuning that is required.With the soft mask Mi(k, l) obtained from Section IV-A, thecovariance of the ith cluster is given by

Ci =∑

(k,l)∈P′

M2i (k, l)x(k, l)xT (k, l) (28)

and the spatial spectrum can be computed as

Ji(u) =1

|qHUiUHi q|

, (29)

where the 4 × 1 vector q = [1, uT ]T with u =[cosψ cosφ, cosψ sinφ, sinψ]T being the steering vector.The variable Ui is a matrix consisting of three (column)eigenvectors corresponding to the smallest eigenvalues of Ci.The direction vector of the ith source is estimated using

ui = arg maxuJi(u), s.t. uTu = 1. (30)

Finally, for different sources, (28)-(30) are computed iterative-ly for each of the clusters.

V. SIMULATION AND EXPERIMENT

Simulations and experiments were conducted to evaluatethe performance of the proposed algorithm. For the simu-lation setup, similar to [15], room impulse responses weregenerated using [30] for a 8 m × 6 m × 4 m room with anAVS located at (4 m, 3 m, 1.3 m). The pressure elementwas set as omni-directional and each vector-sensor elementwas set as bi-directional with orthogonal orientation. Speechsignals sampled at 16 kHz from the TIMIT database [35]were used as source signals, and they were chosen fromtwelve different (male and female) speakers. These twelvedifferent speakers were randomly combined to form two/threeconcurrent speakers and a total of six combinations were used.The sources were placed 1.5 m away from the AVS. Foradditive noise, white Gaussian noise and a recorded noise havebeen used separately for different simulation setups and theywere added to the four AVS channels at different signal-to-noise ratios (SNRs). The recorded noise was obtained using theMicroflown AVS [6] in a meeting room where the experimentswere conducted (see the penultimate paragraph of Section V).The recorded noise mainly consists of air-conditioner noisewith energy level being approximately constant and the SNRwas computed based on the average noise power across the

Elevation (deg)

500

-50300Azimuth (deg)

200100

-2

-1

0

0

(a) SRP algorithm [8]

Elevation (deg)

500

-50300Azimuth (deg)

200100

-10

-5

0

0

Pow

er

e = 34.1◦

(b) MUSIC algorithm [14]

Elevation (deg)

500

-50300Azimuth (deg)

200100

0

-2

0-4

e = 18.1◦

(c) SSP algorithm [21]

Elevation (deg)

500

-50300Azimuth (deg)

200100

0

-2

0-4

e = 15.9◦

(d) LRSS-Zone algorithm [23]

Elevation (deg)

500

-50300Azimuth (deg)

200100

0

-2

-4

0

e = 9.28◦

(e) Proposed LRSS-Point algorithm(without outlier removal)

Elevation (deg)

500

-50300Azimuth (deg)

200100

0

0

-5

e = 3.67◦

(f) Proposed LRSS-Point algorithm(with outlier removal)

Fig. 7: Comparison of power spectrums for a simulated room environmentwith white Gaussian noise. T60 = 450 ms and SNR = 20 dB. The solid dotsindicate the actual source directions positioned at φ1 = 110◦, ψ1 = −10◦

and φ2 = 170◦, ψ2 = 15◦.

recording. For the proposed algorithm, LRSS points are de-tected within a two-second signal block. STFT was conductedduring the signal block with frame length of 128 samples and75% overlap between frames. Each frame is then transformedto the frequency domain with a fast-Fourier transform (FFT)size of 256 samples using zero padding. Threshold for theLRSS point detection was θthr = 7◦ and ε = 0.5 was usedfor outlier removal.

The proposed LRSS-Point algorithm is compared with thesteered response power (SRP) [8], MUSIC [14], single-sourcepoint (SSP) [21] and LRSS-Zone [23] algorithms. The SRPalgorithm was derived for a single-source scenario in [8]and for performance comparison, we extended it to a multi-source case by searching for the local maxima in its spatialpower spectrum. The MUSIC algorithm, although originallyproposed for linear array [14], was implemented for AVS inthis work for fair comparison. The covariance is estimated by(1/N(k,l))

∑(k,l) x(k, l)xT (k, l), where x(k, l) in this case,

denotes any TF point of received signal (without LRSS de-tection), and N(k,l) is the number of TF points in a signalblock. The subspace decomposition is then performed similarto (29) where the number of the smallest eigenvalues is set as


TABLE II: Percentage of the correct number-of-sources estimates indifferent simulated environments with SNR = 10 dB. The percentage isevaluated over 100 trials of signal blocks.

Reverberation time (s)

0.15 0.25 0.35 0.45 0.55 0.65

2 sources𝜙1 = 110∘, 𝜓1 = −10∘

and 𝜙2 = 170∘, 𝜓2 = 15∘100% 100% 100% 100% 98% 96%

2 sources𝜙1 = 130∘, 𝜓1 = −10∘

and 𝜙2 = 165∘, 𝜓2 = 15∘100% 98% 96% 95% 90% 86%

3 sources𝜙1 = 110∘, 𝜓1 = −10∘ ,

𝜙2 = 210∘, 𝜓2 = 15∘ and

𝜙3 = 330∘, 𝜓3 = 20∘

100% 90% 88% 88% 80% 66%

4 − I . The SSP algorithm [21] detects single-source points.However, it does not detect whether a single-source pointsatisfies low-reverberant condition, since only time-smoothingis applied across neighboring TF points (see Section III in [23]for explanation). The LRSS-Zone algorithm [23] detects LRSSpoints at “zone” level, since frequency- and time-smoothingis used across neighboring TF points.

For I number of sources, accuracy is evaluated using theaverage angular error

e =1

I

I∑i=1

2 sin−1 (||ui − ui||/2) , (31)

where the term 2 sin−1 (||ui − ui||/2) quantifies the angle bywhich ui deviates from ui [9]. Due to the lack of mappinginformation between a set of estimates

{ui|i = 1 . . . L

}with

the set of actual DOAs{ui|i = 1 . . . L

}, the average error

in (31) is evaluated across all permutations and the minimumerror is assigned as e. We then quantify the performanceacross all signal blocks using root-mean-square angular error(RMSAE) defined as RMSAE =

√E{e2}.

Table II shows the percentage of the correct number-of-sources estimates using the proposed method in Sec. IV-B fordifferent simulated environments. Each of them is evaluatedover 100 trials of different signal blocks. We observe thatthe proposed method, in general, achieves an accuracy ofgreater than 85%, except for the cases where three sourcesare present with T60 ≥ 550 ms. To simplify the evaluation ofDOA estimation performance in the subsequent simulations,the RMSAE is computed over the signal blocks where thenumber-of-sources are correctly estimated.

Figure 7 compares the computed power spectra for a simu-lated room environment with T60 = 450 ms and SNR = 20 dB.Two sources are positioned at φ1 = 110◦, ψ1 = −10◦ andφ2 = 170◦, ψ2 = 15◦, as indicated by the solid dots (elevatedto the top for better view). It can be observed that the SRPalgorithm, shown in Fig. 7 (a), fails to exhibit two maximafor the two closely-separated sources. This implies that theangular resolution of a single-source DOA estimator is lowin a reverberant environment. The MUSIC algorithm, shownin Fig. 7 (b), exhibits two maxima that deviated from the

T60 (sec)0.15 0.25 0.35 0.45 0.55 0.65

RM

SA

E (

deg)

0

10

20

30

40

50

60MUSIC [14]SSP [21]LRSS-Zone [23]Proposed LRSS-Point

2 sourcesSNR = 15 dB

T60 (sec)0.15 0.25 0.35 0.45 0.55 0.65

RM

SA

E (

deg)

0

10

20

30

40

50


2 sourcesSNR = 5 dB

Fig. 8: Variation of RMSAE with T60 for a simulated environment withwhite Gaussian noise. Two sources are positioned at φ1 = 110◦, ψ1 = −10◦

and φ2 = 170◦, ψ2 = 15◦. The RMSAE is computed over 100 trials ofsignal blocks.

actual source directions due to the effects of reverberation. TheSSP and LRSS-Zone algorithms, shown in Figs. 7 (c) and (d),respectively, result in two maxima closer to the ground truth.The proposed LRSS-Point algorithm is shown in Figs. 7 (e)and (f), which, respectively correspond to without and withoutlier removal. It can be observed that the outlier removalreduces the error from 9.28◦ to 3.67◦ for this example, and theproposed algorithm (with outlier removal) achieves the lowestestimation error among all the algorithms.

Figure 8 shows how the RMSAE varies with T60 for twoactive sources. White Gaussian noise was used and the setupwas kept the same as that for the previous simulation. TheRMSAE is computed over 100 trials of signal blocks. Due tolow angular resolution, the SRP algorithm has been omitted.These results show that the performance of the remaining fouralgorithms reduces with increasing T60 for both tested SNRconditions, as expected. While MUSIC and SSP achieve anerror of less than 5◦ when T60 = 150 ms, their performancedeteriorate significantly with increasing T60. This is becausethese algorithms were derived using the free-space model —they are unable to mitigate any detrimental effects causedby reverberation. The LRSS-Zone algorithm achieves lowererror than MUSIC and SSP due to its exploitation of LRSScondition. The proposed LRSS-Point algorithm achieves lowererror than the LRSS-Zone algorithm. This is because theformer achieves LRSS detection at “point” level, which resultsin a higher probability of detecting LRSS points. It is alsonoted that the RMSAE for the proposed algorithm increaseswhen T60 = 650 ms and SNR = 5 dB. This is because thenumber of points satisfying (21) will decrease with increasingT60.

Figure 9 shows the results when recorded noise is used. Theconfigurations are kept the same as the previous simulation.


T60 (sec)0.15 0.25 0.35 0.45 0.55 0.65

RM

SA

E (

deg)

0

10

20

30

40

50



T60 (sec)0.15 0.25 0.35 0.45 0.55 0.65

RM

SA

E (

deg)

0

10

20

30

40

50


2 sourcesSNR = 5 dB

Fig. 9: Variation of RMSAE with T60 for a simulated environment withrecorded noise. Two sources are positioned at φ1 = 110◦, ψ1 = −10◦ andφ2 = 170◦, ψ2 = 15◦. The RMSAE is computed over 100 trials of signalblocks.

T60 (sec)0.15 0.25 0.35 0.45 0.55 0.65

RM

SA

E (

deg)

0

10

20

30

40

50

60 SSP [21]LRSS-Zone [23]Proposed LRSS-Point


Fig. 10: Variation of RMSAE with T60 for a simulated environment withrecorded noise at SNR = 20 dB. Three sources are positioned at φ1 = 110◦,ψ1 = −10◦, φ2 = 210◦, ψ2 = 15◦ and φ3 = 330◦, ψ3 = 20◦. TheRMSAE is computed over 100 trials of signal blocks.

Since the recorded noise is more correlated across channelsin an AVS, the DOA estimation error increases compared tothe previous simulation. Similarly, although high RMSAE isobserved for the proposed algorithm when T60 = 650 msand SNR = 5 dB, the proposed algorithm generally achieveslower error compared to other algorithms. Figure 10 showsthe results for the three-source case with φ1 = 110◦, ψ1 =−10◦, φ2 = 210◦, ψ2 = 15◦ and φ3 = 330◦, ψ3 = 20◦.Recorded noise was added with SNR = 20 dB. Results for theMUSIC algorithm are not included since it requires the numberof sources to be less than the number of dipole elements inthe AVS. Similar to Figs. 8 and 9, the proposed LRSS-Pointalgorithm achieves the lowest error.

Figure 11 shows the variation of RMSAE with angulardistance between two active sources. The room is simulat-ed with T60 = 300 ms and recorded noise is added atSNR = 15 dB. Five different realizations of source-positionconfigurations are used for each source-separation angle and

Angular difference between sources (deg)50 70 90 110 130 180

RM

SA

E (

deg)

0

10

20

30

40

50

60

70

SSP [21]

LRSS-Zone [23]

Proposed LRSS-Point

SRP [8]

MUSIC [14]

Fig. 11: Variation of RMSAE with angular separation between two activesources for T60 = 300 ms and SNR = 15 dB. Five different realizationsof source-position configuration is used for each source-separation angle andRMSAE is computed over 100 trials of signal blocks.

6.2 m

4.7

m

AVS Loudspeaker 2Loudspeaker 1

Chair

Table

Room height 2.7 mAVS height 1.5 mLoudspeaker height 1.3 m

Fig. 12: Experiment setup in an actual meeting room.

the RMSAE is averaged over these five realizations. TheSRP algorithm achieves a low error when the source angulardistance is 130◦. However, the RMSAE increases significantlywith decreasing source (angular) separation. A similar trendcan be observed for the MUSIC algorithm with angularresolution being affected by reverberation. The SSP, LRSS-Zone and the proposed LRSS-Point algorithms are, in general,less sensitive to the separation distance since these algorithmsidentify and utilize only single-source points in which onlythe dominant source is present. The proposed LRSS-Pointalgorithm achieves the lowest error among these algorithms,including that for the special case of sources being 180◦ apart.It is worth to note that for the proposed algorithm, when twosources are close (q1 ≈ q2) and based on (16) and (17), theabsolute directions of R{x(k, l)} and I{x(k, l)} will be close.This implies that θthr in (21) should further be reduced inorder to discriminate between two closely-separated sources.However, as reverberation increases, the absolute directions ofR{x(k, l)} and I{x(k, l)} will become noisy. This implies thata small threshold may also result in an insufficient number ofTF points being detected as LRSS points.

The proposed LRSS-Point algorithm is further verified usingsignals recorded in an actual meeting room shown in Fig. 12,with an estimated T60 = 450 ms. The room is of dimension6.2 m× 4.7 m× 2.7 m. The Microflown AVS [6], as shownin Fig. 1, was placed approximately at the center of the roomat a height of 1.5 m. Speech signals from the TIMIT databasewere played through loudspeakers to simulate speech sources.These speakers were located 1.6 m away from the AVS. The


Angular difference between the two sources (deg)33 63 117 138

RM

SA

E (

deg)

0

20

40

60


Fig. 13: Variation of RMSAE with angular separation between two activesources for an actual room. The estimated T60 = 450 ms and estimated SNR= 11 dB. The RMSAE is computed across 100 trials of signal blocks.

average SNR at the AVS was estimated to be 11 dB.Figure 13 shows the RMSAE between the four algorithms

for different source-angular separations. As can be observed,the MUSIC and SSP algorithms achieved an error of lessthan 20◦ when the two source directions were well separatedat 117◦. However, these two algorithms are unable to dis-criminate between the sources when they become closer. Inthis experiment, the proposed LRSS-Point outperformed theother three methods. Comparing the experiment result of 63◦

separation in Fig. 13 with simulation shown in Fig. 9 forT60 = 450 ms and SNR = 15 dB, the proposed LRSS-Pointalgorithm exhibits a higher error (≥ 10◦) than that of simula-tion (≈ 5◦). This is because simulation synthesizes reflectionsonly from room boundaries, while in practice, reflections mayalso originate from objects within the recording room (e.g.tables and chairs). For the latter case, reflected waves fromsuch objects may arrive earlier compared to the boundaryreflections. As a result, the assumed direct-path componentof the impulse response hi(k, 0) in (5) may contain earlyreflections and more outliers will be introduced.

VI. CONCLUSION

A multi-source DOA estimation algorithm using a singleAVS is proposed for a reverberant environment. The proposedalgorithm conducts LRSS detection at “point” level. The de-tected LRSS points are then clustered after which the outliersare removed. DOA estimation is then achieved by operatingonly on the remaining reliable LRSS points. Simulation andexperiment results show that the proposed algorithm achieveshigher performance compared to the SRP, MUSIC, SSP andLRSS-Zone algorithms.

REFERENCES

[1] C. Zhang, D. Florencio, D. E. Ba, and Z. Zhang, “Maximum likelihoodsound source localization and beamforming for directional microphonearrays in distributed meetings,” IEEE Trans. Multimedia, vol. 10, no. 3,pp. 538–548, 2008.

[2] J. DiBiase, H. Silverman, and M. Brandstein, “Robust localization inreverberant rooms,” Microphone Arrays: Signal Processing Techniquesand Applications., pp. 157–180, 2001.

[3] Y. Huang, J. Chen, and J. Benesty, “Immersive audio schemes,” IEEESignal Process. Magazine, vol. 28, pp. 20–32, Jan. 2011.

[4] J. Benesty, J. Chen, and Y. Huang, Microphone array signal processing,vol. 1, Springer, 2008.

[5] A. Nehorai and E. Paldi, “Acoustic vector-sensor array processing,”IEEE Trans. Signal Process., vol. 42, no. 9, pp. 2481–2491, Sep. 1994.

[6] Microflown, “Acoustic vector sensor,” http://microflown-maritime.com/avs-technology/, (Accessed: 03/10/2017).

[7] X. Zhong and A. B. Premkumar, “Particle filtering approaches formultiple acoustic source detection and 2-D direction of arrival estimationusing a single acoustic vector sensor,” IEEE Trans. Signal Process., vol.60, no. 9, pp. 4719–4733, 2012.

[8] D. Levin, S. Gannot, and E. A. P. Habets, “Direction-of-arrivalestimation using acoustic vector sensors in the presence of noise,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP’11),2011, pp. 105–108.

[9] D. Levin, E. A. P. Habets, and S. Gannot, “Maximum likelihoodestimation of direction of arrival using an acoustic vector-sensor,” J.Acoust. Soc. Amer., vol. 131, no. 2, pp. 1240–1248, 2012.

[10] M. Hawkes and A. Nehorai, “Acoustic vector-sensor beamforming andCapon direction estimation,” IEEE Trans. Signal Process., vol. 46, no.9, pp. 2291–2304, 1998.

[11] M. Hawkes and A. Nehorai, “Wideband source localization using adistributed acoustic vector-sensor array,” IEEE Trans. Signal Process.,vol. 51, no. 6, pp. 1479–1491, 2003.

[12] H. Chen and J. Zhao, “Coherent signal-subspace processing of acousticvector sensor array for DOA estimation of wideband sources,” SignalProcessing, vol. 85, no. 4, pp. 837–847, 2005.

[13] S. Zhao, S. Ahmed, Y. Liang, K. Rupnow, D. Chen, and D. L. Jones, “Areal-time 3D sound localization system with miniature microphone arrayfor virtual reality,” in Proc. 7th IEEE Int. Conf. Industrial Electronicsand Applications (ICIEA), 2012, pp. 1853–1857.

[14] D. H. Johnson and D. E. Dudgeon, Array signal processing: conceptsand techniques, Simon & Schuster, 1992.

[15] D. Levin, E. A. P. Habets, and S. Gannot, “On the angular error ofintensity vector based direction of arrival estimation in reverberant soundfields,” J. Acoust. Soc. Amer., vol. 128, no. 4, pp. 1800–1811, 2010.

[16] W. Zhang and B. D. Rao, “A two microphone-based approach for sourcelocalization of multiple speech sources,” IEEE Trans. Audio, Speech,Lang. Process., vol. 18, no. 8, pp. 1913–1928, 2010.

[17] A. H. Moore, C. Evers, and P. Naylor, “Direction of arrival estimation inthe spherical harmonic domain using subspace pseudointensity vectors,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 1, pp.178–192, 2017.

[18] O. Nadiri and B. Rafaely, “Localization of multiple speakers underhigh reverberation using a spherical microphone array and the direct-path dominance test,” IEEE/ACM Trans. Audio, Speech, Lang. Process.,vol. 22, no. 10, pp. 1494–1505, 2014.

[19] D. Pavlidi, S. Delikaris-Manias, V. Pulkki, and A. Mouchtaris, “3ddoa estimation of multiple sound sources based on spatially constrainedbeamforming driven by intensity vectors,” in Proc. IEEE Int. Conf.Acoust., Speech, Signal Process. (ICASSP’16). IEEE, 2016, pp. 96–100.

[20] S. Hafezi, A. H. Moore, and P. A. Naylor, “Multi-source estimation con-sistency for improved multiple direction-of-arrival estimation,” in 2017Hands-free Speech Communications and Microphone Arrays (HSCMA),March 2017, pp. 81–85.

[21] S. Mohan, M. E. Lockwood, M. L. Kramer, and D. L. Jones, “Local-ization of multiple acoustic sources with small arrays using a coherencetest,” J. Acoust. Soc. Amer., vol. 123, no. 4, pp. 2136–2147, 2008.

[22] S. Tervo, “Direction estimation based on sound intensity vectors,” in17th European Signal Process. Conference (EUSIPCO 2009). IEEE,2009, pp. 700–704.

[23] Kai Wu, V. G. Reju, and Andy W. H. Khong, “Multi-source direction-of-arrival estimation in a reverberant environment using single acousticvector sensor,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.(ICASSP’15), 2015, pp. 444–448.

[24] X. Chen, W. Wang, Y. Wang, X. Zhong, and A. Alinaghi, “Reverberantspeech separation with probabilistic time-frequency masking for b-format recordings,” Speech Communication, vol. 68, pp. 41 – 54, 2015.

[25] D. Pavlidi, A. Griffin, M. Puigt, and A. Mouchtaris, “Real-time multiplesound source localization and counting using a circular microphonearray,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 10,pp. 2193–2206, 2013.

[26] Y. X. Zou, W. Shi, B. Li, C. H. Ritz, M. Shujau, and J. Xi, “Multisourcedoa estimation based on time-frequency sparsity and joint inter-sensordata ratio with single acoustic vector sensor,” in Proc. IEEE Int. Conf.Acoust., Speech, Signal Process. (ICASSP’13). IEEE, 2013, pp. 4011–4015.

[27] D. Wang, Y. Zou, and W. Wang, “Learning soft mask with dnn and dnn-svm for multi-speaker doa estimation using an acoustic vector sensor,”Journal of the Franklin Institute, 2017.


[28] A. Jukiac, T. V. Waterschoot, T. Gerkmann, and S. Doclo, “Multi-channel linear prediction-based speech dereverberation with sparse pri-ors,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 9,pp. 1509–1520, 2015.

[29] V. G. Reju, S. N. Koh, and I. Y. Soon, “An algorithm for mixing matrixestimation in instantaneous blind source separation,” Signal Processing,vol. 89, no. 9, pp. 1762–1773, 2009.

[30] E. A. P. Habets, “Room impulse response (RIR) generator,” https://www.audiolabs-erlangen.de/fau/professor/habets/software/rir-generator, (Ac-cessed: 21/10/2017).

[31] A. K. Jain K, M. N. Murty, and P. J. Flynn, “Data clustering: a review,”ACM Comput. Surveys, vol. 31, no. 3, pp. 264–323, 1999.

[32] R. Xu and D. Wunsch II, “Survey of clustering algorithms,” IEEE.Trans. Neural Networks, vol. 16, no. 3, pp. 645–678, 2005.

[33] J. MacQueen, “Some methods for classification and analysis of multi-variate observations,” in Proc. the 5th Berkeley Symp. on Math. Statist.and Prob., 1967, vol. 1, pp. 281–297.

[34] J. C. Bezdek, Pattern recognition with fuzzy objective function algo-rithms, Kluwer Academic Publishers, 1981.

[35] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett,N. L. Dahlgrena, and V. Zue, TIMIT Acoustic-Phonetic ContinuousSpeech Corpus, Philadelphia, PA, 1993.

Kai Wu received his B.Eng. degree from the U-niversity of Electronic Science and Technology ofChina, Chengdu, China, in 2010, and the Ph.D.degree from Nanyang Technological University, Sin-gapore, in 2017. His Ph.D. research topic was onthe algorithm development for talker localization andtracking in room environment. From 2015 to 2017,he was an R&D Engineer for audio signal processingapplications in Panasonic R&D Center Singapore.He is currently a Research Scientist in the Agencyfor Science, Technology and Research, Singapore.

His research interests include Bayesian filtering with its applications, arraysignal processing, data analytics and machine learning.

Vaninirappuputhenpurayil Gopalan Reju (M’10)is currently an Associate Professor in the Depart-ment of Instrumentation, Cochin University of Sci-ence and Technology, India. Prior to that, he ob-tained his Ph.D. (’02-’09) from School of Electricaland Electronic Engineering, Nanyang TechnologicalUniversity, Singapore, where he worked as a Re-search Associate (’07-’10), Research Fellow (’10-’15) and Senior Research Fellow (’15-’18) in thesame university. He obtained his B.Tech. (’88-’92)from University of Kerala, India. In 1994 he ob-

tained his M.Tech from Cochin University of Science and Technology, afterwhich he served as a lecturer (’95-’02) in the same university. His researchinterest includes blind source separation, speech enhancement and array signalprocessing.

Andy W. H. Khong (M’06) is currently an As-sociate Professor in the School of Electrical andElectronic Engineering, Nanyang Technological U-niversity, Singapore. Prior to that, he obtained hisPh.D. (’02-’05) from the Department of Electri-cal and Electronic Engineering, Imperial CollegeLondon, after which he also served as a researchassociate (’05-’08) in the same department. Heobtained his B.Eng. (’98-’02) from the NanyangTechnological University, Singapore. His researchinterest includes adaptive filtering, speech enhance-

ment, acoustic source localization and tracking. More recently, his researchincludes the use of machine learning and data mining for education data.Andy was a visiting professor at UIUC in 2012 under the Tan Chin TuanFellowship and is the author/co-author of three ”Best Student Paper Awards”paper. Andy is currently serving as the Associate Editor for the IEEE/ACMTrans. Audio, Speech and Language Processing.

Documents

JOURNAL OF LA Multi-source DOA Estimation In A Reverberant