- Home
- Documents
*Monaural Audio Source Separation using Variational ... 2. Variational Autoencoder The variational...*

Click here to load reader

View

2Download

0

Embed Size (px)

Monaural Audio Source Separation using Variational Autoencoders

Laxmi Pandey∗1, Anurendra Kumar∗1, Vinay Namboodiri1

1Indian Institute of Technology Kanpur [email protected], [email protected], [email protected]

Abstract We introduce a monaural audio source separation framework using a latent generative model. Traditionally, discriminative training for source separation is proposed using deep neural networks or non-negative matrix factorization. In this paper, we propose a principled generative approach using variational autoencoders (VAE) for audio source separation. VAE com- putes efficient Bayesian inference which leads to a continuous latent representation of the input data(spectrogram). It contains a probabilistic encoder which projects an input data to latent space and a probabilistic decoder which projects data from la- tent space back to input space. This allows us to learn a ro- bust latent representation of sources corrupted with noise and other sources. The latent representation is then fed to the de- coder to yield the separated source. Both encoder and decoder are implemented via multilayer perceptron (MLP). In contrast to prevalent techniques, we argue that VAE is a more princi- pled approach to source separation. Experimentally, we find that the proposed framework yields reasonable improvements when compared to baseline methods available in the literature i.e. DNN and RNN with different masking functions and au- toencoders. We show that our method performs better than best of the relevant methods with∼ 2 dB improvement in the source to distortion ratio.

Index Terms - Autoencoder, Variational inference, Latent vari- able, Source separation, Generative models, Deep learning

*

1. Introduction The objective of Monaural Audio Source Separation (MASS) is to extract independent audio sources from an audio mixture in a single channel. Source separation is a classic problem and has wide applications in automatic speech recognition, biomed- ical imaging, and music editing. The problem is very chal- lenging since it’s an ill-posed problem i.e. there can be many combinations of solutions and the objective is to estimate the best possible solution. Traditionally, the problem has been well addressed by non-negative matrix factorization(NMF) [1] and PLCA[2] . These models learn the latent bases which are spe- cific to a source from clean training data. These latent bases are later utilized for separating source from the mixture signal [3]. NMF and PLCA are generative models which work under the assumption that the data can be represented as the linear com- position of low-rank latent bases. Several extensions of NMF and LVM have been employed in literature along with tempo- ral, sparseness constraints [4, 1, 5]. Though NMF and PLCA are scalable, these techniques do not learn discriminative bases and therefore yield worse results when compared to models where

*The first two authors contributed equally

bases are learned on mixtures. Discriminative NMF [6] has been proposed in order to learn mixture specific bases which in turn has shown some improvement over the NMF. NMF based approaches assume that data is a linear combination of latent bases and it may be a limiting factor for real-world data. To model the non-linearity, deep neural networks(DNN), in vari- ous different configurations have been used in source separation [7, 8, 9]. The denoising auto-encoder (DAE) is a special type of fully connected feedforward neural networks which can effi- ciently de-noise a signal [10]. They are used to learn robust low- dimensional features even when the inputs are perturbed with some noise [11] . DAEs have been used for source separation with input as a mixed signal and the output as the target source, both in form of spectral frames [12]. Though DAEs have a lot of advantages, it comes with the cost of high complexity and the loss in spatial information. Fully connected DAEs cannot cap- ture the 2D (spectral-temporal) structures of the spectrogram of the input and output signals and have a lot of parameters to be optimized and hence the system is highly complex. The fully convolutional denoising autoencoders [13] maps the distorted speech signal to its clean speech signal with an application to speech enhancement. Recently, a deep (stacked) fully convo- lutional DAEs (CDAEs) is used for the audio single channel source separation (SCSS) [14]. However, current deep learning approaches for source separation are still computationally ex- pensive with a lot of parameters to tune and not scalable. NMF based approaches, on the other hand, work with the simplistic assumption of linearity and the inability to learn discriminative bases effectively.

In this paper, our goal is to have best of both worlds - i) To learn a set of bases effectively (which is done by encoder and decoder in VAE) and ii) Inexpensive computation. More- over, unlike other methods, VAE can also yield the confidence scores of how good or bad are the separated sources, based on the average posterior variance estimates. VAE has shown state- of-the-art in image generation, text generation and reinforce- ment learning [15, 16, 17, 18, 19]. In this paper, we show the effectiveness of VAE for audio source separation. We compare the performance of VAE with DNN/RNN architectures and au- toencoders. VAE performs better than all methods in terms of a source to distortion ratio (SDR) with ∼ 2 dB improvement.

2. Variational Autoencoder

The variational autoencoder [15] is a generative model which assumes that an observed variable x is generated from an under- lying random process with latent variable z as random variables. In this paper, we aim to learn a robust latent representation of a noisy signal i.e. P (z|x) ≈ P (z|x+ n), where x and n denotes signal and noise respectively. While estimating z for a source, we consider other sources as noise. The latent variable z is fur- ther used to estimate the clean (separated) source. Fig. 1 shows the graphical model of VAE.

z

x

T

Figure 1: Graphical model of VAE. T is total number of spec- tral frames. Dotted line denotes the inference of latent variable while solid line denotes the generative model of observed vari- able.

Mathematically, the model can be represented as:

Pθ(x, z) = Pθ(x|z)Pθ(z) (1)

Pθ(x) =

∫ Pθ(x|z)Pθ(z)dz (2)

VAE assumes that the likelihood function Pθ(x|z) and prior distribution Pθ(z) come from a parametric family of distribu- tions with parameters θ. The prior distribution is assumed to be a Gaussian with zero mean and unit variance:

P (z) = N (z; 0, I) (3)

The likelihood, is often modeled using an independent Gaussian distribution whose parameters are dependent on z,

Pθ(x|z) = N (x;µθ(z), σ2θ(z)I) (4)

where, µθ(z) and σ2θ(z) are non-linear functions of z which is modeled using a neural network. The posterior distribution Pθ(z|x) can be written by Bayes’s formula,

Pθ(z|x) = Pθ(x|z)P (z)∫ Pθ(x, z)dz

(5)

However, the denominator is often intractable. Sampling meth- ods like MCMC can be employed, but these are often too slow and computationally expensive. Variational Bayesian meth- ods solves this problem by approximating the intractable true posterior Pθ(z|x) with some tractable parametric distribution qφ(z|x). The marginal likelihood can be written as [15]

logPθ(x) = DKL[qφ(z|x)||Pθ(z|x)] + L(θ, φ;x) (6)

where,

L(θ, φ;x) = Eqφ(z|x)[logPθ(x, z)− log qφ(z|x)] (7)

where, E and DKL denotes the expectation and KL diver- gence respectively. The above marginal likelihood is again intractable due to KL divergence between approximate and true posterior, since we don’t know true distribution. Since, DKL > 0, L(θ, φ;x) is called as (variational) lower bound and act as a surrogate for optimizing the marginal likelihood. Re-parameterizing the random variable z and optimizing with respect to θ and φ yields [15],

θ, φ = argmax θ,φ

L(θ, φ : x) ≈ argmax θ,φ

L∑ l=1

logPθ(x|zl)

+DKL[qφ(z l|x)||P (z)] (8)

Code and data: github.com/anurendra/vae_sep

where, θ and φ are the parameters of multi layered perceptrons (MLP) for encoders and decoders respectively, L denotes the total number of samples used in sampling. Often a single sam- ple is enough for learning θ and φ, if we have enough training data [15]. Encoders and decoders are implemented via MLP networks with parameters θ and φ respectively. Normally, one layer neural network is used for encoders and decoder in VAE. However, number of layers can be increased for increasing the non-linearity. We call these as deep-VAE in the paper and show that deep-VAE performs better than VAE.

3. Source Separation The audio single channel source separation (SCSS) aims to esti- mate the sources si(t),∀i from a mixed signal y(t) made up of I sources, y(t) =

∑I i=1 si(t). We perform computations in the

short time Fourier transform (STFT) domain. Given the STFT of the mixed signal y(t), the primary goal is to estimate the STFT of each source ŝi(t) in the mixture. Each of the sources is modeled using a single VAE i.e. a specific encoder and de- coder for each source is learned. Fig. 2 shows the architecture of VAE used.

f (Encoder)

g (Decoder)

µ, σ2

µ+σZ

Z: N(0,I)

Time

Fr eq

ue nc

y Input Signal Target Source

Time

Fr eq

ue nc

y

Figure 2: Architecture of VAE for audio source separation

We propose to use as many VAEs as the number of sources to be separated from the mixed signal. Each VAE deals with t