GLIMPSING INDEPENDENT VECTOR ANALYSIS: SEPARATING MORE …dsp.ucsd.edu/~ali/icassp10.pdf · 2010. 6. 8. · Index Terms Overcomplete systems, independent component analysis, convolutive

GLIMPSING INDEPENDENT VECTOR ANALYSIS: SEPARATING MORE SOURCES THANSENSORS USING ACTIVE AND INACTIVE STATES

Alireza Masnadi-Shirazi,Wenyi Zhang and Bhaskar D. Rao

Department of Electrical and Computer Engineering, University of California, San Diego{amasnadi, w3zhang, brao}@ucsd.edu

ABSTRACT

In this paper, we explore the problem of separating convolut-edly mixed signals in the overcomplete (degenerate) case ofhaving more sources than sensors. We exploit a common formof nonstationarity, especially present in speech, wherein thesignals have silence periods intermittently, hence varying theset of active sources with time. A novel approach is proposedthat takes advantage of different combinations of silence gapsin the source signals at each time period. This enables thealgorithm to ”glimpse” or listen in the gaps, hence compen-sating for the global degeneracy by allowing it to learn themixing matrices at periods where it is locally less degenerate.Experiments using simulated and real room recordings werecarried out yielding good separation results.

Index Terms— Overcomplete systems, independentcomponent analysis, convolutive mixtures, blind source sep-aration

1. INTRODUCTION

Frequency domain blind source separation (BSS) methodshave been extensively studied to separate convolutedly mixedsignals. By computing the short time Fourier transform(STFT), convolution in the time domain translates to linearmixing in the frequency domain enabling the use of Inde-pendent component analysis (ICA) on each frequency bin.However, due to ICA being indeterminate in permutation,further post processing methods have to be used to avoid thepermutation problem. Independent vector analysis (IVA) is amethod that avoids such permutation issues by utilizing theinner dependencies between the frequency bins. IVA modelseach individual source as a dependent multivariate symmetricdistribution while still maintaining the fundamental assump-tion of BSS that each source is independent from the other.Therefore, one can say that IVA is the multi-dimensionalextension of ICA [1].

For standard ICA-based methods, when the number ofsources M becomes greater than the number of sensors L

(M > L), the process of estimating the mixing matrix and

This research was supported by UC MICRO grants 07-034, 08-65 spon-sered by Qualcomm Inc.

the sources are not that straightforward. Various methodsin the past with different underlying assumptions have beenproposed to deal with overcompleteness (degeneracy) in ICAlinear mixing. One method uses a maximum likelihood ap-proximation framework for learning the overcomplete mixingmatrix and a Laplacian prior for inferring of the sources[2].Other methods incorporate geometric/probabilistic clusteringapproaches while relying heavily on sparsity assumptions (ateach time mainly one source is active) [3, 4, 5]. All suchmethods, however, do not take into consideration the tempo-ral dynamic structure of the signals.

Most signals of interest in BSS like speech, music andEEG are nonstationary. One common type of nonstationarity,especially present in speech, is that the signals can have inter-mittent silence periods, hence varying the set of active sourceswith time. Such feature can be used to deal with degenerateBSS. As the set of active sources for each time period de-creases, the degree of degeneracy (M − L) decreases locally.Hence, by exploiting silence gaps, one is actually compen-sating for the global degeneracy by making use of segmentswhere it is locally less degenerate. An approach to modelactive and inactive intervals for instantaneous linear mixingovercomplete case has been proposed. This method modelsthe sources as a two-mixture of Gaussians with zero meansand unknown variances similar to that of independent fac-tor analysis(IFA) [6], and incorporates a Markov model ona hidden variable that controls state of activity or inactivityfor each source. A complicated and inefficient three layeredhidden variable (one for the Markov state of activity and twoas in normal IFA) estimation algorithm based on variationalBayes is implemented [7]. Extending this to IVA for convo-luted mixtures proves to be even more complicated. In ourprevious work we proposed a simple and efficient algorithmto model the states of activity and inactivity in the presence ofnoise for the non-denegerate case of convoluted mixing usinga simple mixture model[8]. Unlike the method in [7] wherethe on/off states were embedded in the sources themselves,they were modeled more naturally as controllers turning onand off the columns of the mixing matrices. In this paper webuild upon our previous work to facilitate the degenerate casein convoluted mixing. Moreover, Various studies have con-firmed that human listeners use similar strategies of exploit-

2010978-1-4244-4296-6/10/$25.00 ©2010 IEEE ICASSP 2010

ing silence gaps by ”glimpsing” or listening in the gaps toidentify target speech in adverse conditions of multiple com-peting speakers (see §4.3.2 of [9]). Consequently, we nameour algorithm ”glimpsing IVA”.

2. GENERATIVE MODEL

In this section we define the generative model after it has beentransformed into the frequency domain. We assume there areL observations and M sources where M > L. After tak-ing the STFT (with d frequency bins) of the convolutedlymixed signals corrupted with white Gaussian noise, the obser-vations would have a linear mixture in the frequency domaindescribed as,

Y (1:d)(n) = H(1:d)S(1:d)(n) + W (1:d)(n) (1)

where Y (1:d) =[Y

(1)1 ...Y

(1)L |...|Y

(d)1 ...Y

(d)L

]T

, H(1:d) =⎛⎜⎝

H(1) ... 0...

. . ....

0 ... H(d)

⎞⎟⎠, S(1:d) =

[S

(1)1 ...S

(1)M |...|S

(d)1 ...S

(d)M

]T

and W (1:d) =[W

(1)1 ...W

(1)L |...|W

(d)1 ...W

(d)L

]T

. H(k) is the

L × M mixing matrix for the kth frequency bin. Since thenoise is white it will have the same energy in all frequencybins. Hence the covariance of the noise can be written as

ΣW =

⎛⎜⎝

σW ... 0...

. . ....

0 ... σW

⎞⎟⎠, where σW =

⎛⎜⎝

σw1 ... 0...

. . ....

0 ... σwL

⎞⎟⎠.

Each source is modeled as a multivariate GMM with C mix-tures. The joint density of the sources is the product of themarginal densities, based on independency. Hence, we have

PS

(S

(1:d))

=

M∏j=1

C∑cj=1

αjcjG

(S

(1:d)j , 0, diag(σ

(1)jcj

, ..., σ(d)jcj

))

=

CM∑q=1

wqG(S

(1:d), 0, Vq

)(2)

where∑CM

q=1 =∑C

c1=1 ...∑C

cM =1, wq =∏M

j=1 αjcj, Vq =⎛

⎜⎜⎝v(1)q ... 0...

. . ....

0 ... v(d)q

⎞⎟⎟⎠ and v

(k)q = diag

(σ

(k)1c1

, ..., σ(k)McM

).

The parameters αjcjand σ

(k)jcj

for k = 1, ..., d, j = 1, ...,M

are fixed beforehand corresponding to a Gaussian mixturemodel (GMM) with zero means and varying variances (Gaus-sian scaled mixtures), hence having the shape of a symmetricmultivariate super-Gaussian density. Since each source cantake on two states, either active or inactive, for M sourcesthere will be a total of 2M states. As a convention throughoutthis paper we will encode the states by a number between 1and I = 2M with a circle around it. These states are the samefor all frequency bins and indicate which column vector(s) of

the mixing matrix is(are) present or absent. In the degener-ate case, since the distribution of the data in the sensor spaceis lower in dimension than the source space, data points be-longing to different states can be overlapping. Therefore, inorder to compensate for this, a Markovian state structure isincorporated using hidden Markov models (HMMs) allowingus to estimate the states more accurately compared to the statemixture model in [8]. In order to assure smooth transitions be-tween the states, a non-ergodic HMM is used which assumesthat at each new time instant, only one source can appear ordisappear. The HMM transition diagram is depicted in Fig. 1for M = 3.

Let the source indices form a set Ω = {1, ...,M}, thenany subset of Ω could correspond to a set of active sourceindices. For state i , we denote the subset of active indicesin ascending order by Ωi = {Ωi(1), ...,Ωi(Mi)} ⊆ Ω, whereMi ≤ M is the cardinality of Ωi. It can be easily shown thatthe observation density function for state i is

Pi

(Y (1:d)(n)

)=

∑q

i

wqi

G

(Y (1:d)(n), 0, A(1:d)

qi

)

(3)

where A(1:d)q

i

= ΣW + H(1:d)

i

Vqi

H(1:d)H

i

,∑

qi

=

∑C

cΩi(1)=1 ...

∑C

cΩi(Mi)=1, wq

i

=∏Mi

j=1 αΩi(j)cΩi(j),

H(1:d)

i

=

⎛⎜⎜⎜⎜⎝

H(1)

i

... 0

.... . .

...0 ... H

(d)

i

⎞⎟⎟⎟⎟⎠ with H

(k)

i

=[H

(k)

Ωi(1)...H

(k)

Ωi(Mi)

]

being a subset of the full matrix containing only the Ωi(1)th

to Ωi(Mi)th columns, and Vq

i

=

⎛⎜⎜⎜⎜⎝

v(1)q

i

... 0

.... . .

...0 ... v

(d)q

i

⎞⎟⎟⎟⎟⎠

with v(k)q

i

= diag

(σ

(k)

Ωi(1)cΩi(1), ..., σ

(k)

Ωi(Mi)cΩi(Mi)

).

When all the sources are active, the observation density in(3) uses the full mixing matrix and when none of the sourcesare active, the observation density reduces to white Gaussiannoise.

3. PARAMETER ESTIMATION AND SOURCERECONSTRUCTION

Expectation Maximization (EM) is used to learn the HMMinitial probabilities πi, the HMM transition probabilitiesaij =P (x(n) = i|x(n− 1) = j), the mixing matrices and noisecovariance. The E-step consists of finding the probabilityγn(i) =P

(x(n) = i|Y (1:d)(1), ..., Y (1:d)(N)

)by the forward/bac-

kward probabilities αn(i) =P(Y (1:d)(1), ..., Y (1:d)(n), x(n) = i

)and βn(i) =P

(Y (1:d)(n + 1), ..., Y (1:d)(N)|x(n) = i

), using the

2011

Fig. 1. State transition diagram for M = 3 assuming thatonly one source can appear or disappear at a time

relation γn(i) =αn(i)βn(i)/∑I

j=1 αn(j)βn(j) and the for-ward/backward recursions of

αn(i) = Pi

(Y (1:d)(n)

) I∑j=1

aijαn−1(j) (4)

βn(i) =

I∑j=1

Pj

(Y (1:d)(n + 1)

)ajiβn+1(j) (5)

The M-Step consists of updating the initial and transitionprobabilities as

π+i = α1(i)β1(i)/

I∑j=1

α1(j)β1(j) (6)

a+ij =

N∑n=2

aijαn−1(j)βn(i)Pi

(Y (1:d)(n)

)/

N∑n=2

αn−1(j)βn−1(j)

(7)

and taking a couple of steps along the gradient of the auxiliaryQ function with respect to the mixing matrices and the noisecovariance

∇H(k)Q(θ, θ) =

N∑n=1

I∑i=1

γn(i)

⎛⎜⎝ ∂

∂H(k)

i

Pi

(Y (1:d)(n)

)⎞⎟⎠∗

Pi

(Y (1:d)(n)

) (8)

∇σWQ(θ, θ) =

N∑n=1

I∑i=1

γn(i)

(∂

∂σWP

i

(Y (1:d)(n)

))∗

Pi

(Y (1:d)(n)

) (9)

The entries in the numerators of (8) and (9) are similar tothose derived in [8]. Finally, once the parameters have beenestimated, the reconstruction of the sources is done by using aminimum mean squared error (MMSE) estimator in [8] usingthe last update of γn(i) as the soft indicator function.

4. COMPUTER SIMULATIONS

The GMM parameters used to model the sources were learnedby fitting a multivariate GMM with 3 mixture components, toa 30-min-long continuous speech with no prolonged silenceperiods and normalized to unit variance. As a preprocess-ing step, the Fourier coefficients for each frequency bin were

whitened. Hence, some minor modifications need to be madeto the gradients in the M-step to ensure that the noise co-variance is scaled properly. The proposed algorithm was putto test in both simulated and real room settings. The imagemethod was used to simulate convoluted mixing for a room.

We assumed a room size of 8x5x3.5m with a reverberationtime of around 200ms. We chose the case of M = 3 speakersand L = 2 microphones being 10cm apart. The three sourceswere located 1m away, at the same height and at angles -40o, -5o and 20o with respect to the microphones. The data was cor-rupted with white Gaussian noise, with the noise level result-ing in an input signal to noise ratio (SNRin) of 11(dB). Fig. 2shows the true and recovered sources along with the probabil-ity of each source being active. This figure illustrates that thealgorithm is successfully able to separate the sources whilesuppressing interference during the silence periods. Signalto disturbance ratio (SDR) is used as the performance mea-sure. SDR is the total signal power of direct channels versusthe signal power stemming from cross interference and noisecombined, therefore giving a reasonable performance mea-surement for noisy situations. For this degenerate setting, weget SDR=4.5(dB). We compare this with the SDR achievedwhen adding an extra microphone (M = L = 3) so thatthe problem is no longer degenerate. Using the method in[8] we get SDR=6.6(dB). This shows that even though theformer (degenerate) setting has one microphone less than thelatter (non-degenerate) setting, the performance is relativelyhigh and comparable. Naturally, as the degree of degeneracy(M−L) increases and/or the silence periods diminish, the per-formance of the degenerate case deteriorates to a greater ex-tent with respect to its upgraded non-degenerate setting. Nat-urally, as the degree of degeneracy (M − L) increases and/orthe silence periods diminish, the performance of the degen-erate overcomplete case deteriorates to a greater extent withrespect to its upgraded complete setting.

Finally, we recorded real data in an ordinary room setting.The sources consisted of three loudspeakers positioned on atable 1-2m away from the two microphones. The sourceswere also recorded separately by one of the microphoneswhen played one at a time, and synchronized with the origi-nal recording. This was done in order to create a perceptualcomparison measure. The separation results yielding goodperceptual separation are presented in Fig. 3. Furthermore,the estimated state probabilities γn(i) are shown next to thesources in Fig. 4. The audio files are available at our website1.

5. CONCLUSIONS

We have proposed a novel approach to noisy overcompleteBSS for convolutive mixing by making use of the temporalstructure of silent gaps present in many nonstationary signals,especially speech. By mimicking the separation strategy ofthe human hearing system, this algorithm is able to exploit

1http://dsp.ucsd.edu/˜ali/glimpsing/

2012

0 2 4 6 8 10

−0.5

0

0.5

0 2 4 6 8 10

−0.5

0

0.5

0 2 4 6 8 10

0

0.5

1

time(s)

0 2 4 6 8 10

−0.5

0

0.5

0 2 4 6 8 10

−0.5

0

0.5

0 2 4 6 8 10

0

0.5

1

time(s)

0 2 4 6 8 10

−0.5

0

0.5

0 2 4 6 8 10

−0.5

0

0.5

0 2 4 6 8 10

0

0.5

1

time(s)

Fig. 2. Experimental results of a simulated room mixing using 2 microphones . Top: true sources. Middle: separated sources.Bottom: probability of source activity

0 2 4 6 8 10 12 14

−0.5

0

0.5

0 2 4 6 8 10 12 14

−0.5

0

0.5

0 2 4 6 8 10 12 14

0

0.5

1

time(s)

0 2 4 6 8 10 12 14

−0.5

0

0.5

0 2 4 6 8 10 12 14

−0.5

0

0.5

0 2 4 6 8 10 12 14

0

0.5

1

time(s)

0 2 4 6 8 10 12 14

−0.5

0

0.5

0 2 4 6 8 10 12 14

−0.5

0

0.5

0 2 4 6 8 10 12 14

0

0.5

1

time(s)

Fig. 3. Experimental results of real room mixing using 2 microphones. Top: true sources (recorded). Middle: separated sources.Bottom: probability of source activity

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0

0.5

1

time(s)

1 2 3 1,2 1,3 2,3 None 1,2,3

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14−0.5

0

0.5

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14−0.5

0

0.5

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14−0.5

0

0.5

time(s)

Fig. 4. Estimated state probabilities (top) along with therecorded true sources for real room mixing.

the local decrease of degeneracy during the different combi-nations of silent gaps of the sources which allows it to coverall possible states from when all sources are active to whenonly one is active at each instant, therefore doing its best tocompensate for the apparent global degeneracy. The algo-rithm works naturally by learning the columns of the mixingmatrices in a specialized and combinatorial fashion based onthe probability of being in each state. Good separation resultscomparable to even non-degenerate BSS were achieved fortwo convoluted mixtures of three speech signals using simu-lated and real recordings.

6. REFERENCES

[1] T. Kim, H. Attias, and T.-W. Lee, “Blind source separationexploiting higher-order frequency dependencies,” IEEE Trans.Speech, Audio and Language Processing, vol. 15, 2007.

[2] T. Lee, M. Lewicki, M. Girolami, and T. Sejnowski, “Blindsource separation of more ssources than mixtures using over-complete representations,” IEEE Sign. Process. Letters, vol. 6,1999.

[3] N. Mitianoudis and T. Stathaki, “Batch and online underdeter-mined source separation using laplacian mixture models,” IEEETran. on Speech, Audio and Language Processing, vol. 15, pp.1818–1832, 2007.

[4] P. Bofill and M. Zibulevsky, “Underdetermined blind sourceseparation using sparse representations,” Sign. Process., vol.81, pp. 2353–2362, 2001.

[5] P. D. O’Grady and B. A. Pearlmutter, “Soft-lost: Em on a mix-ture of oriented lines,” in Proc. ICA, 2004, pp. 430–436.

[6] H. Attias, “Independent factor analysis,” Neural Comp., vol. 11,1999.

[7] J.-I. Hirayama and S.-I. Maeda S. Ishii, “Markov and semi-markov switching of source appearances for nonstationary inde-pendent component analysis,” IEEE Trans. on Neural Networks,vol. 18, 2007.

[8] A. Masnadi-Shirazi and B. Rao, “Independent vector analy-sis incorporating active and inactive states,” in IEEE Proc. ofICASSP, 2009.

[9] P.C. Loizou, Speech Enhancement: Theory and Practice, CRCPress, 2007.

2013

Documents

GLIMPSING INDEPENDENT VECTOR ANALYSIS: SEPARATING MORE …dsp.ucsd.edu/~ali/icassp10.pdf · 2010. 6. 8. · Index Terms Overcomplete systems, independent component analysis, convolutive