Acoustic Positioning

Implementation of an acoustic localization

algorithm for video camera steering

Paolo Minero, Mustafa Ergen

{minero,ergen}@eecs.berkeley.edu

University of California at Berkeley

1 Introduction

Advances in wireless networking and sensor network design open new prospective in performing

demanding collaborative sensing and signal processing. This project was motivated by the am-

bitious plan of designing and implementing acoustic applications in low-cost distributed wireless

sensor networks. As first step, we considered the problem of acoustic localization through a

set of distributed sensors. Acoustic localization has been extensively studied for microphone

arrays since it has several practical interests, e.g. video surveillance, video conferencing, home

and military applications.

In this report, we first introduce a classic mathematical model for the propagation of wide-

band acoustic sources in a reverberant environment. Then we review two traditional approaches

for acoustic localization, based on the maximum-likelihood and the cross-correlation methods.

Finally, we describe the implementation of a Windows-based wireless networked acoustic sensor

testbed used for video camera steering.

1

2 Channel model

In the following we consider wideband signals (i.e. 30 Hz - 15 KHz) like voices, vibration

or movements. It is assumed to operate in the near-field regime and that the propagation

speed is known (typically, for an acoustic source, 345m/s). The acoustic source is located in

a reverberant space. Consider the situation in which 2R sensors are randomly distributed in

space and a single acoustic source generates the wavefront x(t) in a multipath environment.

The channel between the source and each sensor is modeled as an LTI system, and the signal

received at each sensor is sampled at the rate (fc) of one sample per unit of time. Assume that

all sensors sample the received signal synchronously. We assume that the channel response has

a finite number of taps L, and that the channel taps do not vary over N sample times. Starting

at time 1, a sequence of samples yi[1], yi[2], . . . is received at the ith sensor i = 1, .., 2R. The

resulting single-input multiple-output (SIMO) channel is:

yi[n] =L−1∑

k=0

hi[n]x[n− l] + wi[n] i = 1, 2, . . . 2R (1)

where hi[n] is the sampled impulse response of the channel between the source and the ith

sensor and wi[n] is additive background noise, modeled as a white gaussian process with zero

mean and variance σ2n. The noise is assumed uncorrelated from sensor to sensor. In vector

form, the SIMO system is equivalent to

yi = Hix + wi n = 1, 2, . . . 2R (2)

2

where

yi = [yi[1] yi[2] . . . yi[N ]]T

xi = [xi[1] xi[2] . . . xi[N ]]T

Hi =

hi[0] 0 · · · 0

hi[1] hi[0] . . . 0

. . . . . . . . . 0

hi[L] hi[L− 1] . . . 0

. . . . . . . . . 0

. . . hi[L− 1] . . . hi[0]

3 Problem formulation

We are interested in estimating the location of the source rs = [rsx rsx rsx ]T , given the received

samples yi[n] and the location of each sensor ri = [rix rix rix ]T i = 1, .., 2R. Let us partition

the sensors into pairs, and let us denote as j1 and j2 the two sensors in the j th pair. In a multi-

path channel such as (1), an observable signal characteristic is the time difference of arrival

(TDOA) of a source relative to each pair of sensors. In fact, the signal component propagating

through the direct path is usually stronger than the component due to reflections. Hence, the

channel tap due to the direct path is most of the time the largest in magnitude. As such, the

propagation delay in samples τi between the ith sensor and the source can be determinied as

the

τi = arg maxj|hi[j]| j = 1, 2, . . . L (3)

And the TDOA between the two channels in the j th pair of sensors is obtained as

τj = τj1 − τj2 j = 1, 2, . . . R (4)

3

Given the TDOA and the sampling rate fc, one can easily compute the relative distance of

the acoustic source from the pair of microphones as

Dj =τjv

fcj = 1, 2, . . . L (5)

where v is the propagation speed of the sound. The TDOA estimates of each pair of sensors,

along with the knowledge of the sensor position are used to arrive at a source location estimate.

Given Dj , the sound source rs must lie in a three dimensional hyperboloid with equations

d(rj1−rs)−d(rj2−rs) = Dj , where d(x,y) indicated the Euclidian distance between vectors x

and y. Let denote with D = [D1 . . . DR]T the set of distances from the source to every pair of

sensors. The vector D identifies a set of hyperboloids and, as a consequence, the location of the

source must be the unique intersection point of this set of hyperboloids. However, the channel

impulse response hi is usually not known at the ith sensor and it must be estimated from the

received samples yi. The problem of the channel estimation is particularly challenging since

we do not know the transmitted signal x. The channel must be blindly estimated. It the next

section we will derive the optimum ML estimator for the joint estimation of the source signal

x and the channel hi given the yi. Errors in the channel estimation may introduce errors in

D, and there might not exist a point which interpolates all the hyperboloids. So, given a set

of R pairs of sensors, we estimate the source location as the vector rs which minimizes the LS

error with the vector D. Thus, the source location is estimated as

rs = arg minr

R∑

i=1

|d(ri1 − r)− d(ri2 − r)−Di|2 (6)

In the next section two methods for the channel and TDOA estimations are introduced.

4

4 TDOA estimation

4.1 ML estimation

Let us consider the ith sensor. We are interesting in jointly estimating the source signal x

and the channel hi given the received yi. The block of samples yi can be transformed to the

frequency domain by a DFT of length Nc. In order to approximate the linear convolution in

(1) appropriate zero padding can be applied, so in general Nc > N . In the frequency domain,

the channel model is given by

Yi = HiX + Wi i = 1, 2, . . . R (7)

where Yi = [Yi[1] Yi[1] . . . Yi[Nc]]T represents the Nc-point DFT of the sampled data received

at the ith sensor. The ML estimation problem can be expressed as:

maxHi,X

f (Y | Hi,X) (8)

It can be shown that the solution to the ML problem is the following:

Hi = minHi

∥∥∥∥(

I −Hi

(HT

i Hi

)−1

HTi

)Y

∥∥∥∥2

(9)

X =(HT

i Hi

)−1

HTi Y (10)

Given the channel estimate (9), one can compute the propagation delay in samples as

τ = arg maxj

∣∣∣IDFT (Hi)∣∣∣ (11)

Complexity prohibits the use of (9) in real-time applications. More in general, all the optimal

strategies proposed in literature are computationally intense, but tend to possess robustness in

5

realistic environments. The majority of practical localization systems are based on less complex

but suboptimal estimators.

4.2 GCC method

The most commonly used method for TDOA is the GCC method, introduced in [2]. The TDOA

estimate of each pair of sensors is obtained as the time lag which maximizes the cross-correlation

function between filtered versions of the received signal. The GCC function is in fact defined as

the cross-correlation function of the filtered signals. In [2], a mathematical model of the signal

at the two sensors:

y1(t) = x(t) + n1(t)

y2(t) = αx(t−D) + n2(t)

Where n1(t), n2(t) and x(t) are assumed jointly wide-sense stationary. Voice signal is usually

suppose stationary in 20ms-30ms frames. A property of the autocorrelation function is that

R(τ) ≤ R(0). The cross-correlation of y1(t) and y1(t) is

Ry1y2(τ) = Rxx(τ) ∗ δ(τ −D) (12)

And it has a peak for t = D. So, the delay estimate is

D = argmaxxRy1y2(τ) (13)

If the input signal is ergodic, the cross-correlation function is approximated by the correlation

function in time, given by

Ry1y2(τ) =1

T − τ

∫ T

τ

y1(t)y1(t− T )dt (14)

6

The cross correlation function is related to the cross-spectral density function through an inverse

Fourier transform

Ry1y2(τ) =∫ ∞

−∞Gy1y2(f)ej2πfτdt (15)

In order to eliminate the influence introduced by the possible auto-correlation of a source signal,

it is desirable to pre-filter y1 and y2 using a filter before their cross-correlation is computed. Var-

ious filter design has been proposed (whitening filters, Wiener-Hopf, ML for Gaussian sources).

When y1 and y2 are filtered, then the cross-correlation function becomes

Ry1y2(τ) =∫ ∞

−∞Ψ(f)Gy1y2(f)ej2πfτdt (16)

GCC technique is not good in a reverberant environment (the mathematical model assumes free

space propagation). A basic approach to dealing with multi-path channel is deemphasizing the

frequency-dependent weighting. The Phase Transform method (PHAT) places equal emphasis

on each component of the cross-spectrum phase. The corresponding filter design is

Ψ(f) =1

|Gy1y2 |(17)

The resulting peak corresponds to the dominant delay in the reverberant signal. However, it

accentuates components with poor SNR. It is known that the PHAT filter performs badly under

high reverberation and high noise conditions. We can apply the GCC method to the channel

model in (1). For starters, the model can be written as

yi = Hdix + zi n = 1, 2, . . . R (18)

Where Hdi is the component of the channel responce due to the direct path, and zi is some

colored noise, given by the sum of wi and the the component of the signal due to reverberation.

As an example, let us consider the j th pair of microphones. A possible representation of the

two channel matrix is

7

Hdj1=

hj1 [0] 0 · · · · · ·

0 hi[0] 0 · · ·

· · · · · · · · · · · ·

· · · · · · 0 hi[0]

· · · · · · · · · 0

Hdj2=

· · · hj2 [m] 0 · · ·

· · · 0 hj2 [m] 0

· · · · · · · · · · · ·

· · · · · · 0 hj2 [m]

· · · · · · · · · 0

(19)

In this specific case we have that

yj1 [n]yj1 [n + τ ] ≤ yj1 [n]yj1 [n + m] = hj1hj2 |x[n]|2 + zj1 [n]zj2 [n + m] (20)

More in general, the TDOA for a pair of sensors can be directly computed as the index which

maximize the correlation of the two received data samples. Thus, the TDOA can be formally

expressed in the following way:

τj = arg maxm

N∑

k=0

yj1 [k]yj2 [k + m] (21)

The advantage of this approach compared to the ML estimation is that the GCC method

is computationally less demanding. On the other hand, the noise is now correlated and in

presence of multipath the (21) may not lead to a global maximum. In fact, reverberation causes

spurious peaks which may have greater amplitude than the peak due to the true source, so that

choosing the maximum peak may not give accurate results. The frequency of these spurious

peaks increases dramatically as the reverberation time become larger than 200ms. Also,

the method cannot accommodate multi-source scenarios. Altought suboptimal, the method

performs well in low reverberant rooms and the semplicity suggests the use of this method in

real-time applications.

8

5 Implementation

5.1 The synchronization problem

The GCC/PHAT method (17) has been implemented in an acoustic tracking application. This

method assumes that each pair of sensors in the network performs synchronous sensing. By

synchronous sensing we mean that the audio signals must be sampled synchronously and sam-

pled data must be shared in each pair to perform the signal processing. On the other hand,

different pairs may not need to be synchronized and samples data can be collected with time

tags at a central node; thus, the synchronization requirement is limited to each pair. Syn-

chronous sensing implies limited timing errors. When sampling at rate higher than 8KHz, the

timing error cannot exceed a few microseconds. Beside clock synchronization in different nodes,

perfect synchronous sampling can be achieved by controlling the latencies of the sampling sys-

tem. However, achieving clock synchronization of the order of microseconds in a distributed

sensor network is a challenging task and many audio cards have large and non-deterministic

latencies when asked to start recording. In [3], clock synchronization is achieved implementing

RBS algorithm on Linux machines and the accuracy of the sampling sub-system is controlled

using an ”audio server”. RBS is a protocol for high precision synchronization, which, on the

other hand, demands a significant communization load. In our implementation, we avoided

the synchronization problem by attaching two microphones to the same board, and sampling

the received signal through a common processor. This choice also solves the problem of the

accuracy of the sampling sub-system, since the non-deterministic latency of the audio card is

equal for the two microphones. The resulting network topology is drawn is Figure 1, where

each pair of microphones is connected to a processing board, and the boards are wireless linked

to a central node.

9

mic1 mic2

mic3

Laptop

Laptop

Laptop

Laptop

Central Laptop

camera mic1 mic2

mic3

Laptop

Laptop

Laptop

Laptop

Central Laptop

camera

Figure 1: Network structure

5.2 Hardware and software requirements

The hardware system consists of eight microphones, five laptops and one video camera. The

number of microphones is arbitrary, as long as multiple of two. The microphones are divided

into pairs, and each pair is connected to a laptop. Since the microphone input line is mono,

the microphones must be connected to the laptop through the line in. However, the line in

is un-powered, so the microphones must be powered with a battery. We tested two types of

microphones: the Radio Shack Omnidirectional Boundary Microphone and Sony ECM-F8. We

used five IBM ThinkPad running Windows XP and the camera we tested is a Sony EVI-D71.

The microphones cannot be plugged to the line in through a simple mini-to-mini Y adaptor.

Instead, the microphones will plug into two mini-to-RCA connectors, and the two female RCA

will fit into a RCA-to-mini Y adaptor. We used standard connectors and adaptors available in

stores. The remaining laptop is used as information sink and to send command to the camera.

10

In the following, we will refer to the central laptop as server and to the laptops connected to

the microphones as clients. Clients and server communicate through a TCP connection, so the

laptops must be connected via Wireless LAN or Ethernet. We connected the laptops through

an 802.11a access point. The operating system used in each laptop is Windows XP, and Matlab

(with Data Acquisition Toolbox) must be installed on each machine. The package we developed

consists of the following files:

• server.m: it is the main application run at the server. Based on the distance estimates, it

performes the LS algorithm for estimating the source location (6) and it sends commands

to the camera.

• client.m: it is the main application run at the client. It estimates the source distance

through the PHAT method (17) and sends information to the server.

• myfun.m and LMS.m: code run at the server while performing the LS algorithm (6).

• detect voice mex.dll : silence/speech detector run at the client.

• MCreateFigure.m and Updater.m: they create and update the main figure displayed at

the server.

• pnet.m, pnet getvar.m, pnet putvar.m, pnet remote.m: they manage the TCP/IP connec-

tion.

• camcmd.exe and Devserv.exe: these applications sends command to the camera through

UDP packets.

5.3 Setup

The core of the code consists in two files, server.m and client.m. Before setting up the pa-

rameters in these files, we need to define a cartesian set of coordinates for the room, then fix

11

the camera and microphone locations and measure these positions with respect to the chosen

set of coordinates. In this demo it is assumed that the server knows exactly the position of

the microphones and the camera, so the coordinates must be stored in server.m. Then, we

need to connect server and clients in the same sub-network. Each client should be able to ping

the server, and the port chosen for the TCP application should not be in conflict with other

applications. Finally, we need to configure the operating system of each client to use Line-In

for recording. Since we are using Windows, this can be done using the Volume Control Panel.

Finally, some specific parameters might be set in server.m and client.m. Refer to the code

documentation for details. The video camera must be connected to the server.

5.4 Running the demo

Once server.m is run, a window will be showed on the server screen (an example is showed

in Figure 2). The figure shows the geometry of the room and the relative position of the

microphones. A green circle represents the final estimate of the acoustic source location. The

size of the circle is proportional to the estimation error, i.e. the smaller the dot the more

accurate the estimate. After that, the server waits for connections from the clients. Every time

a packet is received from a client, the distance estimates are updated with the new information

stored in the packet. If at least one distance estimate has changed the LS algorithm (6) is run

to estimate again the source location. When the LS algorithm produces an estimate with an

error sufficiently small, the source location is updated on the figure and a command is sent to

steer the camera. The user can stop the server by clicking on the ’Stop’ button in the figure.

Once client.m is run, Matlab samples the sound signal received at the two microphones and

stores a vector of samples corresponding to one second of sound. The two data vectors are

sampled synchronously and ready to be processed. A simple silence/speech detector analyses

12

Figure 2: Network structure

the sample data to identify frames of voice. If the presence of voice is detected, then the

GCC/PHAT algorithm is run to provide an estimate of the source distance. If the information

is reliable, i.e. the estimate is not bigger than the distance between the two microphones, the

estimate is sent to the server through the TCP connection. After that, Matlab stores a new

sound sample to process until a timer set by the user expires.

6 Conclusion and future work

In this project, we implemented a Window-based wireless sensor network testbed for acoustic

localization. The testbed has been used for a camera steering application. Future work includes

the implementation of the algorithm on ZigBee boards. The major challenge to face is the

design of an efficient synchronization protocol with minimal communication load. Also, we aim

13

to use sensor networks for acoustic beamforming speech enhancement applications, e.g acoustic

beamforming and noise suppression.

References

[1] J. C. Chen, Kung Yao and R. E. Hudson, Source localization and beamforming, in IEEE

Signal Processing Magazine, vol. 19, no. 2, March 2002, pp. 30-39.

[2] C. H. Knapp and G. C. Carter. The generalized correlation method for estimation of time

delay. IEEE Transactions on ASSP, 24(4):320-327,1976.

[3] J.C. Chen, R.E. Hudson, and K. Yao, Maximum likelihood source localization and unknown

sensor location estimation for wideband signals in the near-field, IEEE Trans. on Signal

Processing, vol. 50, Aug. 2002, pp. 1843-1853.

[4] S. Hirsch Acoustic Tracker documentation, www.mathworks.com.

14

Documents

Acoustic Positioning