Singing Voice Separation using Generative Adversarial Networksmedia.aau.dk/smc/wp-content/uploads/2017/12/ML4AudioNIPS17_pap… · Singing Voice Separation using Generative Adversarial

Singing Voice Separation using GenerativeAdversarial Networks

Hyeong-seok Choi , Kyogu LeeMusic and Audio Research Group

Graduate School of Convergence Science and TechnologySeoul National University

{kekepa15, kglee}@snu.ac.kr

Ju-heon LeeCollege of Liberal StudiesSeoul National [email protected]

Abstract

In this paper, we propose a novel approach extending Wasserstein generativeadversarial networks (GANs) [3] to separate singing voice from the mixture signal.We used the mixture signal as a condition to generate singing voices and appliedthe U-net style network for the stable training of the model. Experiments with theDSD100 dataset show the promising results with the potential of using the GANsfor music source separation.

1 Introduction

Music source separation is the process of separating a specific source from a music signal. Separatingthe source from the mixture signal can be interpreted as maximizing the likelihood of the source froma given mixture. Our task is to perform this task using GANs [1] which are classified as a method tomaximize the likelihood by using implicit density. GANs are usually used to produce samples fromnoise, but in recent years, research [7,8] has been under way to better tailor the desired sample with acertain constraint. In this paper, our research aims to generate singing voice signals using mixturesignals as a condition.

2 Background

GANs are generative model that learns a function generator Gθ to map noise samples z ∼ p(z) intothe real data space. The main idea of training GANs is often described as a mini-max game betweentwo players which are discriminator (D) and generator (G) [1]. The input of D is either real samplexxx ∼ Pr or fake sample x̃̃x̃x ∼ Pg and the mission of D is to classify x̃̃x̃x as fake and to classify xxx asreal. Many improved GANs model was attempted [2,3,5] and one of the notable GANs studies thatprovides both theoretic background and practical result is the Wasserstein GANs. It is a model thattries to reduce the Wasserstein distance between the data distribution (Pr) and the generated sampledistribution (Pg). Using the Wasserstein distance, the GANs training can be done as follows. Notethat, x̃ = G(z), z ∼ p(z) and D is a set of function that holds 1-Lipschitz condition.

minG

maxD∈D

Ex∼Pr [D(x)]− Ex̃∼Pg [D(x̃)] (1)

In order to enforce D to be a function that holds 1-Lipschitz condition, [4] suggests to regularizeobjective function by adding a gradient penalty term. Note that Px̂ is a sampling distribution thatsamples from the straight line between x ∼ Pr and x̃ ∼ Pg, that is, x̂ = ε · x+ (1− ε) · x̃, when0 ≤ ε ≤ 1 and λg is a gradient penalty coefficient.

L = Ex̃∼Pg[D(x̃)]− Ex∼Pr

[D(x)] + λg · Ex̂∼Px̂[(‖∇x̂D(x̂)‖2 − 1)2] (2)

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

3 Model setup

3.1 Objective function

We define xm, xs, and x̃s as mixture, real source paired with mixture, and fake (generated) sourcepaired with mixture respectively. In our setting, the goal of G is to transform xm into x̃s as similar aspossible to xs and the goal of D is to distinguish real source xs from the fake source x̃s conditionedon xm. To formulate this, we changed the aforementioned objective (2) into conditional GANsfashion [7, 8]. Thus, the input of D becomes the concatenation of either (xm, x̃s) or (xm, xs). Forthe gradient penalty term, we uniformly sampled x̂ms ∼ Px̂ms from the straight line between theconcatenation of (xm, x̃s) and (xm, xs) [4].

L = Exm∼Pdata, x̃s∼Pg[D(xm, x̃s)]− E(xm,xs)∼Pdata

[D(xm,xs)]

+ λg · E(xm,xs)∼Pdata, x̃s∼Pg, x̂ms∼Px̂ms[(‖∇x̂msD(xm, x̂ms)‖2 − 1)2]

(3)

As a final objective for generator, we added l1 loss term to check the effect of more conventional lossand experimented with three cases including the objective containing only l1 loss, only generativeadversarial loss and finally a case that adds both terms together. Therefore, our final objective for eachgenerator (LG) and discriminator (LD) is as follows. The coefficients for adversarial loss, gradientpenalty loss and l1 loss are denoted as λD, λg and λl1.

LG = −λD · Exm∼Pdata,x̃s∼Pg [D(xm, x̃s)] + λl1 · Exs∼Pr x̃s∼Pg [‖xs − x̃s‖1] (4)

LD = λD · (Exm∼Pdata, x̃s∼Pg [D(xm, x̃s)]− E(xm,xs)∼Pdata[D(xm,xs)])

+ λg · E(xm,xs)∼Pdata, x̃s∼Pg, x̂ms∼Px̂ms[(‖∇x̂msD(xm, x̂ms)‖2 − 1)2]

(5)

3.2 Network structure for generator

Our generator model is constructed as follows. First, as a deep neural network we adapt U-netstructure [6]. U-net consists of encoding and decoding stages and the layers in each stage arecomposed of convolutional layers. In the encoding stage, inputs are encoded with convolutionallayers followed by batch normalization. The inputs are encoded until it becomes a vector with alength of 2048. Then, by using a fully connected layer(FC layer) we encode it to a vector with alength of 512. In the following decoding stage, the input of each layer is concatenated in channel axisby using skip connection from each corresponding layer of encoding stage. Then, the concatenatedlayers are decoded by deconvolutional layers followed by batch normalization. For the non-linearityfunctions of each convolutional and deconvolutional layer, we used leaky Relu except the last layerusing Relu. The more details are described in Figure 1.

3.3 Network structure for discriminator

Our discriminator model is constructed as follows. First, input either (xm, x̃s) or (xm, xs) isconcatenated in channel axis. We use 5 layers of convolutional layers without batch normalizationsince it is invalid in gradient penalty setting [4]. After each batch normalization layer, we used leakyRelu as a non-linearity function except the last layer that we didn’t use any non-linearity. The moredetails are described in Figure 2.One noticeable aspect of our discriminator model is the fact that we intended to make the output tohave the size of 64×16 [7]. This allows the each pixel value of the output to equally contribute to theWasserstein distance we compute by simply taking the mean of the output pixel values. In this way,each pixel of the output corresponds to the each different receptive region with the same receptivesize of 115×31. Intuitively, we assume that this is a better idea than having a full receptive size ofinput size(512×128), since the receptive size is roughly a quarter of the input size, and hence, eachpixel is able to make a decision over a different time-frequency region of input. Also, in practice, wefound out that it is not only time consuming to train but also the train fails when the receptive sizebecomes bigger as the layer of discriminator becomes deeper.

2

reshape

Encode (Convolution layers)

𝐹 ∶ 7×7S_h : 2S_w : 2C : 64

𝐹 ∶ 5×5S_h : 2S_w : 2C : 128

𝐹 ∶ 3×3S_h : 2S_w : 2C : 512

𝐹 ∶ 3×3S_h : 2S_w : 1C : 512

𝐹 ∶ 3×3S_h : 2S_w : 1C : 512

𝐹 ∶ 5×5S_h : 2S_w : 2C : 256

𝐹 ∶ 3×3S_h : 2S_w : 2C : 512

𝐹 ∶ 3×3S_h : 2S_w : 2C : 512

Hei

ght :

512

Width : 128

64

256

32

128

64

168 32

16

4 4 8 44 22 11

22

𝐹 ∶ 7×7S_h : 2S_w : 2

C : 1

𝐹 ∶ 5×5S_h : 2S_w : 2C : 64

𝐹 ∶ 5×5S_h : 2S_w : 2C : 128

𝐹 ∶ 3×3S_h : 2S_w : 2C : 256

𝐹 ∶ 3×3S_h : 2S_w : 2C : 512

𝐹 ∶ 3×3S_h : 2S_w : 1C : 512

𝐹 ∶ 3×3S_h : 2S_w : 1C : 512

𝐹 ∶ 3×3S_h : 2S_w : 2C : 512

𝐹 ∶ 3×3S_h : 1S_w : 1C : 512

Decode (Deconvolution layers)

Skip connectionsChannel: 1

reshape

Figure 1: Network structure for generator. It consists of two stages, encoding and decoding withskip connections from encoding layers. F denotes filter size, S_h denotes strides over height, S_wdenotes strides over width and C denotes the output channel for the next layer.

𝐹 ∶ 15×3S_h : 2S_w : 2C : 64

𝐹 ∶ 15×3S_h : 2S_w : 2C : 128

𝐹 ∶ 15×3S_h : 2S_w : 2C : 256

𝐹 ∶ 3×3S_h : 1S_w : 1C : 512

𝐹 ∶ 3×3S_h : 1S_w : 1

C : 1

64

16

64

16

64

16128

3264

256

Hei

ght :

512

Width : 128Channel : 2

Figure 2: Network structure for discriminator.

4 Preliminary experiments

4.1 Dataset

The DSD100 data set was used for model training. The DSD100 consists of 50 songs as adevelopment set and 50 songs as a test set, each consisting of mixture and four sources (vocal, bass,drums and others). All recordings are digitized with a sampling frequency of 44,100Hz.

4.2 Mini-batch composition

To train our conditional GANs model, we composed our mini-batch having two parts, one as acondition part and the other one as a target source part. It might seem natural to include only themixtures into the condition part, but we composed the condition part to include some proportion ofsinging voice sources as well as the mixtures. This technique was tried due to the nature of commonpopular music that includes intro, interlude and outro composed only with accompaniment, and hencethe term "mixture" itself does not say much about the music signal composed of both singing voiceand accompaniment. Because of this reason, lots of time the target source turns out to be a zeromatrix which is not good for the training of the model. Moreover, in the music of the real world, thereis also a chance of singing voice appearing only by itself (e.g., Acappella). Therefore we thought

3

there is also a need to prepare for this situation. In most experiments, the ratio between the mixturesand the singing voices in condition part was adjusted to 7:1. This is illustrated in Figure 3.

𝑥"𝑥#

𝑥$"𝑥"

𝑥# 𝑥"

𝑥𝑥$" "𝑥"

Mixture Vocal

ConditionTarget

source

𝑥#

𝑥$"𝑥"

𝑥#

Mixture

ConditionTarget

source

Figure 3: Composition of mini-batch used in training. Mixture, true vocal, and fake vocal fromgenerator is denoted as xm, xs and x̃s, respectively.

4.3 Pre- & post-processing

As a preprocessing, the songs in the dataset are split into audio segments to have a time length of 2seconds with a overlap of 1 second between each audio segment. Then, we converted each stereoaudio segment to a mono by taking the mean of two channels. Next, we down-sampled the audiosegment to 16,200Hz and then performed short time Fourier transform on this waveform with awindow size of 1024 frames and hop length of 256 frames. This setting in turn makes a segment ofaudio into a matrix with a size of 512×128. As a post-processing, to change the final extracted vocalspectrogram as a waveform, we simply applied inverse spectrogram Fourier transform using a phaseof input mixture spectrogram.

4.4 Results

In Figure 4, we show four log magnitude spectrograms to compare the effect of generative adversarialloss and l1 loss. We found out that by using generative adversarial loss, the network tries to removethe accompaniment part more aggressively compared to the case when we only use l1 loss. Thus,we assume that one of the keys to train this model is to adjust the coefficients of l1 loss (λl1) andgenerative adversarial loss term (λD). Still, we have not evaluated our algorithm with the commonmetrics in the music source separation task which are SDR (Source to Distortion Ratio), SIR (Sourceto Interference Ratio), and SAR (Source to Artifact Ratio). However, for a fair quantitative evaluation,we are planning to compare our model with the algorithm evaluation results in Signal SeparationEvaluation Campaign(SiSEC) 2016. The generated vocal samples of our model are available on thedemo website1.

(a) (b) (c) (d) (e)

Figure 4: Log magnitude spectrograms of (a) mixture, (b) true vocal, (c) estimated vocal usinggenerative adversarial loss only, (d) estimated vocal using l1 loss only, and (e) estimated vocal usingboth generative adversarial loss and l1 loss.

1Demo audio samples for our model are available on the website : https://kekepa15.github.io/

4

References

[1] Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. & Bengio,Y. (2014). Generative adversarial nets. Advances in neural information processing systems,2672-2680.

[2] Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z., & Smolley, S. P. (2016). Least squares generative adversarialnetworks. arXiv preprint ArXiv:1611.04076.

[3] Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein generative adversarial networks. In InternationalConference on Machine Learning, 214-223.

[4] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. (2017). Improved Training ofWasserstein GANs. arXiv preprint arXiv:1704.00028

[5] Kodali, N., Abernethy, J., Hays, J., & Kira, Z. (2017). How to Train Your DRAGAN. arXiv preprintarXiv:1705.07215.

[6] Ronneberger, O. (2017). Invited Talk: U-Net Convolutional Networks for Biomedical Image Segmentation.Informatik aktuell Bildverarbeitung für die Medizin 2017,3-3.

[7] Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2016). Image-to-image translation with conditional adversarialnetworks.arXiv preprint arXiv:1611.07004.

[8] Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.

5

Documents

Singing Voice Separation using Generative Adversarial Networksmedia.aau.dk/smc/wp-content/uploads/2017/12/ML4AudioNIPS17_pap… · Singing Voice Separation using Generative Adversarial