(2016) Hennequin and Rigaud - Long-Term Reverberation Modeling for Under-Determined Audio Source Separation with Application to Vocal Melody Extraction

LONG-TERM REVERBERATION MODELING FORUNDER-DETERMINED AUDIO SOURCE SEPARATION WITH

APPLICATION TO VOCAL MELODY EXTRACTION.

Romain HennequinDeezer R&D

10 rue d’Athenes, 75009 Paris, [email protected]

Francois RigaudAudionamix R&D

171 quai de Valmy, 75010 Paris, [email protected]

ABSTRACT

In this paper, we present a way to model long-term rever-beration effects in under-determined source separation al-gorithms based on a non-negative decomposition frame-work. A general model for the sources affected by rever-beration is introduced and update rules for the estimationof the parameters are presented. Combined with a well-known source-filter model for singing voice, an applica-tion to the extraction of reverberated vocal tracks frompolyphonic music signals is proposed. Finally, an objec-tive evaluation of this application is described. Perfor-mance improvements are obtained compared to the samemodel without reverberation modeling, in particular bysignificantly reducing the amount of interference betweensources.

1. INTRODUCTION

Under-determined audio source separation has been a keytopic in audio signal processing for the last two decades.It consists in isolating different meaningful ‘parts’ of thesound, such as for instance the lead vocal from the ac-companiment in a song, or the dialog from the backgroundmusic and effects in a movie soundtrack. Non-negativedecompositions such as Non-negative Matrix Factoriza-tion [5] and its derivative have been very popular in thisresearch area for the last decade and have achieved state-of-the art performances [3, 9, 12].

In music recordings, the vocal track generally containsreverberation that is either naturally present due to therecording environment or artificially added during the mix-ing process. For source separation algorithms, the effectsof reverberation are usually not explicitly modeled andthus not properly extracted with the corresponding sources.Some studies [1, 2] introduce a model for the effect ofspatial diffusion caused by the reverberation for a multi-channel source separation application. In [7] a model for

c© Romain Hennequin, Francois Rigaud. Licensed undera Creative Commons Attribution 4.0 International License (CC BY 4.0).Attribution: Romain Hennequin, Francois Rigaud. “Long-term rever-beration modeling for under-determined audio source separation with ap-plication to vocal melody extraction.”, 17th International Society for Mu-sic Information Retrieval Conference, 2016. This work has been donewhen the first author was working for Audionamix.

the dereverberation of spectrograms is presented for thecase of long reverberations, i.e. when the reverberationtime is longer than the length of the analysis window.

We propose in this paper to extend the model of re-verberation proposed in [7] to a source separation appli-cation that allows extracting the reverberation of a spe-cific source together with its dry signal. The reverbera-tion model is introduced first in a general framework forwhich no assumption is made about the spectrogram of thedry sources. At this state, and as often in demixing appli-cation, the estimation problem is ill-posed (optimizationof a non-convex cost function with local minima, resulthighly dependent on the initialization, ...) and requires theincorporation of some knowledge about the source signals.In [7], this issue is dealt with a sparsity prior on the un-reverberated spectrogram model. Alternatively, the spec-trogram sources can be structured by using models of non-negative decompositions with constraints (e.g. harmonicstructure of the source’s tones, sparsity of the activations)and/or by guiding the estimation process with prior infor-mation (e.g. source activation, multi-pitch transcription).Thus in this paper we propose to combine the generic re-verberation model with a well-known source/filter modelof singing voice [3]. A modified version of the originalvoice extraction algorithm is described and evaluated on anapplication to the extraction of reverberated vocal melodiesfrom polyphonic music signals.

Note that unlike usual application of reverberation mod-eling, we do not aim at extracting dereverbated sources butwe try to extract accurately both the dry signal and the re-verberation within the same track. Thus, the designationsource separation is not completely in accordance with ourapplication which targets more precisely stem separation.

The rest of the paper is organized as follows: Section 2presents the general model for a reverberated source andSection 3 introduces the update rule used for its estima-tion. In Section 4, a practical implementation for whichthe reverberation model is combined with a source/filtermodel is presented. Then, Section 5 presents experimentalresults that demonstrate the ability of our algorithm toextract properly vocals affected by reverberation. Finally,conclusions are drawn in Section 6.

206

Notation• Matrices are denoted by bold capital letters: M. The

coefficients at row f and column t of matrix M isdenoted by Mf,t.

• Vectors are denoted by bold lower case letters: v.• Matrix or vector sizes are denoted by capital letters:T , whereas indexes are denoted with lower case let-ters: t.

• Scalars are denoted by italic lower case letters: s.• � stands for element-wise matrix multiplication

(Hadamard product) and M�λ stands for element-wise exponentiation of matrix M with exponent λ.

2. GENERAL MODEL

For the sake of clarity, we will present the signal model formono signals only although it can be easily generalized tomultichannel signals as in [6]. In the experimental Section5, a stereo signal model is actually used.

2.1 Non-negative Decomposition

Most source separation algorithms based on a non-negativedecomposition assume that the non-negative mixture spec-trogram V (usually the modulus or the squared modulusof a time-frequency representation such as the Short TimeFourier Transform (STFT)) which is a F ×T non-negativematrix, can be approximated as the sum ofK source modelspectrograms Vk, which are also non-negative:

V ≈ V =

K∑

k=1

Vk (1)

Various structured matrix decomposition have been pro-posed for the source models Vk, such as, to name a few,standard NMF [8], source/filter modeling [3] or harmonicsource modeling [10].

2.2 Reverberation Model

In the time-domain, a time-invariant reverberation can beaccurately modeled using a convolution with a filter andthus be written as:

y = h ∗ x, (2)

where x is the dry signal, h is the impulse response of thereverberation filter and y is the reverberated signal.

For short-term convolution, this expression can be ap-proximated by a multiplication in the frequency domainsuch as proposed in [6] :

yt = h� xt, (3)

where xt (respectively yt) is the modulus of the t-th frameof the STFT of x (respectively y) and h is the modulus ofthe Fourier transform of h.

For long-term convolution, this approximation does nothold. The support of typical reverberation filters are gener-ally greater than half a second which is way too long for aSTFT analysis window in this kind of application. In this

case, as suggested in [7], we can use an other approxima-tion which is a convolution in each frequency channel :

yf = hf ∗ xf , (4)

Where yf , hf

and xf are the f -th frequency channel ofthe STFT of respectively y, h and x.

Then, starting from a dry spectrogram model Vdry,k ofa source with index k, the reverberated model of the samesource is obtained using the following non-negative ap-proximation:

Vrev,kf,t =

Tk∑

τ=1

Vdry,kf,t−τ+1R

kf,τ (5)

where Rk is the F ×Tk non-negative reverberation matrixof model k to be estimated.

The model of Equation (5) makes it possible to takelong-term effects of reverberation into account and gen-eralizes short-term convolution models as proposed in [6]since when Tk = 1, the model corresponds to the short-term convolution approximation.

3. ALGORITHM

3.1 Non-negative decomposition algorithms

The approximation of Equation (1) is generally quantifiedusing a divergence (a measure of dissimilarity) between Vand V to be minimized with respect to the set of parame-ters Λ of all the models:

C(Λ) = D(V|V(Λ)) (6)

A commonly used class of divergence is the element-wise β-divergence which encompasses the Itakura-Saitodivergence (β = 0), the Kullback-Leibler divergence (β =1) and the squared Frobenius distance (β = 2) [4]. Theglobal cost then writes:

C(Λ) =∑

f,t

dβ(Vf,t|Vf,t(Λ)). (7)

The problem being not convex, the minimization is gener-ally done using alternating update rules on each parame-ters of Λ. The update rule for a parameter Θ is commonlyobtained using an heuristic consisting in decomposing thegradient of the cost-function with respect to this parameteras a difference of two positive terms, such as

∇ΘC = PΘ −MΘ, PΘ ≥ 0, MΘ ≥ 0, (8)

and then by updating the parameter according to:

Θ← Θ� MΘ

PΘ. (9)

This kind of update rule ensures that the parameter re-mains non-negative. Moreover the parameter is updated ina direction descent or remains constant if the partial deriva-tive is zero. In some cases (including the update rules

Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016 207

we will present), it is possible to prove using a Majorize-Minimization (MM) approach [4] that the multiplicativeupdate rules actually lead to a decrease of the cost func-tion.

Using such an approach, the update rules for a standardNMF model V = WH can be expressed as:

H← H�WT

(V�β−2 �V

)

WT V�β−1, (10)

W←W �

(V�β−2 �V

)HT

V�β−1HT. (11)

3.2 Estimation of the reverberation matrix

When dry models Vdry,k are fixed, the reverberation matrixcan be estimated using the following update rule appliedsuccessively on each reverberation matrix:

Rk ← Rk �

(V�β−2 �V

)∗t Vdry,k

V�β−1 ∗t Vdry,k(12)

where ∗t stands for time-convolution:

[V�β−1 ∗t Vdry,k

]f,τ

=T∑

τ=t

V�β−1f,τ Vdry,k

f,τ−t+1. (13)

The update rule (12) obtained using the procedure de-scribed in Section 3.1 ensures the non-negativity of Rk.This update can be obtained using the MM approach whichensures that the cost-function will not increase.

3.3 Estimation with other free model

A general drawback of source separation models that donot explicitly account for reverberation effects is that thereverberation affecting a given source is usually spreadamong all separated sources. However, this issue can stillarise with a proper model of reverberation if the dry modelof the reverberated source is not constrained enough. In-deed using the generic reverberation model of Equation(5), the reverberation of a source can still be incorrectlymodeled during the optimization by other source modelshaving more degrees of freedom. A possible solution forenforcing a correct optimization of the reverberated modelis to further constrain the structure of the dry model spec-trogram, e.g. through the inclusion of sparsity or pitch ac-tivation constraints, and potentially to adopt a sequentialestimation scheme. For instance, first discarding the re-verberation model, a first rough estimate of the dry sourcemay be produced. Second, considering the reverberationmodel, the dry model previously estimated can be refinedwhile estimating at the same time the reverberation matrix.Such an approach is described in the following section fora practical implementation of the algorithm to the problemof lead vocal extraction from polyphonic music signals.

4. APPLICATION TO VOICE EXTRACTION

In this section we propose an implementation of our rever-beration model in a practical case: we use Durrieu’s algo-

rithm [3] for lead vocal isolation and add the reverberationmodel over the voice model.

4.1 Base voice extraction algorithm

Durrieu’s algorithm for lead vocal isolation in a song isbased on a source/filter model for the voice.

4.1.1 Model

The non-negative mixture spectrogram model consistsin the sum of a voice spectrogram model based on asource/filter model and a music spectrogram model basedon a standard NMF:

V ≈ V = Vvoice + Vmusic. (14)

The voice model is based on a source/filter speech pro-duction model:

Vvoice = (WF0HF0)� (WKHK). (15)

The first factor (WF0HF0) is the source part correspond-ing to the excitation of the vocal folds: WF0 is a matrixof fixed harmonic atoms and HF0 is the activation of theseatoms over time. The second factor (WKHK) is the filterpart corresponding to the resonance of the vocal tract: WK

is a matrix of smooth filter atoms and HK is the activationof these atoms over time.

The background music model is a generic NMF:

Vmusic = WRHR. (16)

4.1.2 Algorithm

Matrices HF0, WK, HK, WR and HR are estimated mini-mizing the element-wise Itakura-Saito divergence betweenthe original mixture power spectrogram and the mixturemodel:

C(HF0,WK,HK,WR,HR) =∑

f,t

dIS(Vf,t|Vf,t), (17)

where dIS(x, y) = xy − log(xy ) − 1. The minimization is

achieved using multiplicative update rules.

The estimation is done in three steps:

1. A first step of parameter estimation is done usingiteratively the multiplicative update rules.

2. The matrix HF0 is processed using a Viterbi decod-ing for tracking the main melody and is then thresh-olded so that coefficients too far from the melody areset to zero.

3. Parameters are re-estimated as in the first step butusing the thresholded version of HF0 for the initial-ization.

208 Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016

4.2 Inclusion of the reverberation model

As stated in Section 3, the dry spectrogram model (i.e. thespectrogram model for the source without reverberation)has to be sufficiently constrained in order to accurately es-timate the reverberation part. This constraint is here ob-tained through the use of a fixed harmonic dictionary WF0

and mostly, by the thresholding of the matrix HF0 that en-forces the sparsity of the activations.

We thus introduce the reverberation model after the stepof thresholding of the matrix HF0. The two first steps thenremains the same as presented in Section 4.1.2. In the thirdstep, the dry voice model of Equation (15) is replaced by areverberated voice model following Equation (5):

Vrev. voicef,t =

T∑

τ=1

Vvoicef,t−τ+1Rf,τ . (18)

For the parameter re-estimation of step 3, the multi-plicative update rule of R is given by Equation (12). Forthe other parameters of the voice model, the update rulesfrom [3] are modified to take the reverberation model intoaccount:

HF0 ← HF0 �WT

F0

((WKHK)�

(R ∗t (V�β−2 �V)

))

WTF0

((WKHK)�

(R ∗t V�β−1

))

(19)

HK ← HK �WT

K

((WF0HF0)�

(R ∗t (V�β−2 �V)

))

WTK

((WF0HF0)�

(R ∗t V�β−1

))

(20)

WK ←WK �

((WF0HF0)�

(R ∗t (V�β−2 �V)

))HT

K((WF0HF0)�

(R ∗t V�β−1

))HT

K

(21)

The update rules for the parameters of the music model(HR and WR), are unchanged and thus identical to thosegiven in Equations (10) and (11).

5. EXPERIMENTAL RESULTS

5.1 Experimental setup

We tested the reverberation model that we proposed withthe algorithm presented in Section 4 on a task of lead vocalextraction in a song. In order to assess the improvement ofour model over the existing one, we ran the separation withand without reverberation modeling.

We used a database composed of 9 song excerpts ofprofessionally produced music. The total duration of allexcerpts was about 10 minutes. As the use of rever-beration modeling only makes sense if there is a signifi-cant amount of it, all the selected excerpts contains a fairamount of reverberation. This reverberation was alreadypresent in the separated tracks and was not added artifi-cially by ourselves. On some excerpts, the reverberation

is time-variant: active on some parts and inactive on other,ducking echo effect . . . Some short excerpts, as well as theseparation results, can be played on the companion web-site 1 .

Spectrograms were computed as the squared modulusof the STFT of the signal sampled at 44100 Hz, with 4096-sample (92.9 ms) long Hamming window with 75% over-lap. The length T of the reverberation matrix was arbitrar-ily fixed to 52 frames (which corresponds to about 1.2 s)in order to be sufficient for long reverberations.

5.2 Results

In order to quantify the results we use standard metrics ofsource separation as described in [11]: Signal to DistorsionRatio (SDR), Signal to Artefact Ratio (SAR) and Signal toInterference Ratio (SIR).

The results are presented in Figure 1 for the evaluationof the extracted voice signals and in Figure 2 for the ex-tracted music signals. The oracle performance, obtainedusing the actual spectrograms of the sources to computethe separation masks, are also reported. As we can see,adding the reverberation modeling increases all these met-rics. The SIR is particularly increased in Figure 1 (morethan 5dB): this is mainly because without the reverbera-tion model, a large part of the reverberation of the voiceleaks in the music model. This is a phenomenon which isalso clearly audible in excerpts with strong reverberation:using the reverberation model, the long reverberation tail ismainly heard within the separated voice and is almost notaudible within the separated music. In return, extractedvocals with the reverberation model tend to have more au-dible interferences. This result is in part due to the factthat the pre-estimation of the dry model (step 1 and 2 ofthe base algorithm) is not interference-free, so that apply-ing the reverberation model increases the energy of theseinterferences.

Figure 1. Experimental separation results for the voicestem.

1 http://romain-hennequin.fr/En/demo/reverb_separation/reverb.html

Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016 209

Figure 2. Experimental separation results for the musicstem.

6. CONCLUSION

In this paper we proposed a method to model long-term ef-fects of reverberation in a source separation application forwhich a constrained model of the dry source is available.Future work should focus on speeding up the algorithmsince, multiple convolutions at each iteration can be time-consuming. Developing methods to estimate the reverber-ation duration (of a specific source within a mix) wouldalso make it possible to automate the whole process. Itcould also be interesting to add spatial modeling for multi-channel processing using full rank spatial variance matrixand multichannel reverberation matrices.

7. REFERENCES

[1] Simon Arberet, Alexey Ozerov, Ngoc Q. K. Duong,Emmanuel Vincent, Frederic Bimbot, and Pierre Van-dergheynst. Nonnegative matrix factorization and spa-tial covariance model for under-determined reverberantaudio source separation. In International Conferenceon Information Sciences, Signal Processing and theirapplications, pages 1–4, May 2010.

[2] Ngoc Q. K. Duong, Emmanuel Vincent, and RemiGribonval. Under-determined reverberant audio sourceseparation using a full-rank spatial covariance model.IEEE Transactions on Audio, Speech, and LanguageProcessing, 18(7):1830 – 1840, Sept 2010.

[3] Jean-Louis Durrieu, Gael Richard, and BertrandDavid. An iterative approach to monaural musical mix-ture de-soloing. In International Conference on Acous-tics, Speech, and Signal Processing (ICASSP), pages105–108, Taipei, Taiwan, April 2009.

[4] Cedric Fevotte and Jerome Idier. Algorithms for non-negative matrix factorization with the beta-divergence.Neural Computation, 23(9):2421–2456, September2011.

[5] Daniel D. Lee and H. Sebastian Seung. Learning theparts of objects by non-negative matrix factorization.Nature, 401(6755):788–791, October 1999.

[6] Alexey Ozerov and Cedric Fevotte. Multichannel non-negative matrix factorization in convolutive mixturesfor audio source separation. IEEE Transactions on Au-dio, Speech and Language Processing, 18(3):550–563,2010.

[7] Rita Singh, Bhiksha Raj, and Paris Smaragdis.Latent-variable decomposition based dereverberationof monaural and multi-channel signals. In IEEE Inter-national Conference on Audio and Speech Signal Pro-cessing, Dallas, Texas, USA, March 2010.

[8] Paris Smaragdis and Judith C. Brown. Non-negativematrix factorization for polyphonic music transcrip-tion. In Workshop on Applications of Signal Processingto Audio and Acoustics, pages 177 – 180, New Paltz,NY, USA, October 2003.

[9] Paris Smaragdis, Bhiksha Raj, and MadhusudanaShashanka. Supervised and semi-supervised separationof sounds from single-channel mixtures. In 7th Inter-national Conference on Independent Component Anal-ysis and Signal Separation, London, UK, September2007.

[10] Emmanuel Vincent, Nancy Bertin, and Roland Badeau.Adaptive harmonic spectral decomposition for mul-tiple pitch estimation. IEEE Transactions on Au-dio, Speech and Language Processing, 18(3):528–537,March 2010.

[11] Emmanuel Vincent, Remi Gribonval, and CedricFevotte. Performance measurement in blind audiosource separation. IEEE Transactions on Audio,Speech and Language Processing, 14(4):1462–1469,July 2006.

[12] Tuomas Virtanen. Monaural sound source separationby nonnegative matrix factorization with temporal con-tinuity and sparseness criteria. IEEE Transactions onAudio, Speech and Language Processing, 15(3):1066–1074, March 2007.

210 Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016

Documents

(2016) Hennequin and Rigaud - Long-Term Reverberation Modeling for Under-Determined Audio Source Separation with Application to Vocal Melody Extraction