VOCAL SEPARATION FROM SONGS USING DEEP …juhan/gct634/2018/finals/vocal... · 2018. 9. 14. · [5]Santiago Pascual, Antonio Bonafonte, and Joan Serra. Segan: Speech enhancement generative

VOCAL SEPARATION FROM SONGSUSING DEEP CONVOLUTIONAL ENCODER-DECODER NETWORKS

Duyeon Kim∗ Jaehoon Oh∗

∗Both authors contributed equally to this work.Graduate School of Knowledge Service Engineerking, [email protected], [email protected]

ABSTRACT

Nowadays, the Sound Source Separation(SSS) is an in-teresting task for Music Information Retrieval(MIR) re-searchers. Because it is related to many other MIR taskssuch as singer identification, singer verification, and voiceconversion. In addition, if we can separate clean vocalfrom songs, we can also make clean instruments by revers-ing phase. However, it is very difficult to extract a cleansource because a song includes a lot of sound. In this paper,we propose a Deep Convolutional Encoder-Decoder Net-works (DCEDNs) based on U-Net model, compare the per-formance of vocal separation to this baseline model, andanalyze more detail considerations for better performance.

1. INTRODUCTION

Vocal separation is an important basic task in Music Infor-mation Retrieval(MIR), because many tasks in MIR are re-lated with whether source sound is mixed or not. Throughvocal separation, we can separate vocal and instrumentsound, which become a basic of a variety of music applica-tions. For instance, auto lyric analysis, identifying singer,music similarity extraction, genre classification and recom-mendation quality can be higher by using separated vocaland instrument sounds.

Since separation is an important issue, so many stud-ies approached by using Non-negative matrix factorizationmainly [8]. In recent years, more and more attempts havebeen made to apply various deep-learning techniques. Inthis project, we attempted to perform vocal separation byapplying a convolutional encoder-decoder structure. Theresults are not perfect, but it was possible to extract onlyvocal part somewhat rationally.

2. RELATED WORK

U-Net is originally introduced in biomedical image fields[7]. However, Andreas Jansson et al. [3] showed that U-Net can also show good performance in sound domain bytransforming waveforms into spectrogram. U-Net is com-posed of fully convolutional encoder network and fully

c© Duyeon Kim, Jaehoon Oh. “vocal separation from songsusing deep convolutional encoder-decoder networks, GCT634 , KAIST,Korea, 2018.

convolutional decoder network. And it has skip connec-tion from encoder to decoder network.

Another approach is using Generative Adversarial Net-works(GAN) [2] which is composed of generator networkthat make fake input, and discriminator network that deter-mines whether the input is fake or real. Santiago et al. [5]made Speech Enhancement GAN(SEGAN), which is theend-to-end GAN model applied to a domain that filtersout voice from noise environment. And SVSGAN [1] isthe GAN model that separating voice from music by usingspectrogram.

Our model is based on U-Net, because it shows thebetter performance than GAN approaches. We tried a lotof architecture modulating, parameter, and hyperparameterchanging.

3. MODEL ARCHITECTURE

Figure 1 represents our overall architecture. It is composedof three parts: Feature Extraction, DCEDNs, and AudioReconstruction.

3.1 Feature Extraction

We transform waveform into spectrogram to make inputand targets of the network by the following procedure:

• Step1: We sampled waveforms of songs at 44,100Hzand did not downsample. Then, we cut songs in 8second increments.

• Step2: We transformed waveform into STFT matrix us-ing Short-Time Fourier Transform(STFT) with awindow size of 1024 and hop length of 512 frames.

• Step3: We decomposed STFT matrix into magnitudespectra and phase information and normalized themagnitude with zero mean and unit variance. As aresult, we extract patches of 680 frames that we useas input and targets to the network.

3.2 Deep Convolutional Encoder-Decoder Networks

Figure 2 represents detail of DCEDNs architure. The net-work consists of convolutional encoder and decoder lay-ers, which shape looks like bottleneck. The encoder layershave repeated structure of the 2d convolution that keepsimage size and doubles the number of channels, and the

Figure 1. Overall architecture

2d convolution that cuts image size in half by maxpoolingand keeps the number of channels. All encoder layers have3x3 size kernel, stride of 1, and padding of 1 and use batchnormalization and leaky ReLU with negative slop of 0.2.

The decoder layers have repeated structure of the 2d de-convolution that doubles image size and cuts the numberof channels in half with kernel size of 5x5, stride of 2, andpadding of 2, and the 2d convolution that keeps both im-age size and the number of channels with kernel size of3x3, stride of 1, and padding of 1. All decoder layers usebatch normalization, and ReLU. And first 3 decoder layershave dropout with probability of 0.5. In addition, encoderlayer and decoder layer with having the same image sizeare concatenated for skip connection. The difference be-tween U-Net and ours is whether there are layers that keepsthe number of channels.

Figure 2. Deep Convolutional Encoder-Decoder Networks

The input of model is magnitude spectrogram of mix-ture, and the output of model is called mask, which filterinstrument sound from mixture according to element-wise

with input. For the training, we set the L1,1 loss functionbetween masked magnitude spectrogram of Mixture andmagnitude spectrogram of Vocal:

L(X,Y ;M) = ||X �M − Y ||1,1

whereX is a magnitude of Mixture,M is a mask,�meanselement-wise product, M � X is a masked magnitude ofMixture, and Y is a magnitude of Vocal. We trained thenetwork 100 epochs using ADAM optimizer with weightdecay 1e-4, batch size of 5, and learning rate of 0.0001.

3.3 Audio Reconstruction

We re-transformed spectrogram into waveform to listen tothe reconstructed sound. To re-transform, we product thephase of Mixture and the masked magnitude of Mixtureelementwisely. After that, we could get the waveform ofseparated vocal using Inverse STFT.

4. EXPERIMENTS

4.1 Dataset

We used the musdb18 dataset, which is a dataset of 150full lengths music tracks of different styles along with theirisolated drums, bass, vocals, and other stems [6]. Figure 3shows components of dataset. All signals are stereophonicand encoded at 44.1kHz. There are 100 songs for trainingand 50 songs for testing. We used Mixture as input dataand Vocal as target data.

Figure 3. The Components of the Dataset

4.2 Analyzing spectrogram

During the training, we looked closely the change of spec-trogram according to epoch and found that the values inhigh frequency band are recovered gradually(Figure 4). Itis more difficult to find the detail in high frequency bandthan to find the fundamental tone in low frequency band.

Figure 4. The change of spectrogram per 20 epochs (topto 5th figure) and the spectrogram of vocal (bottom)

We also observed that when we trained the baselinemodel and our model with the same epochs, the base-line model removes more drastically than ours(Figure 5).Therefore, though the baseline model has lower noises thanours, vocal source is also removed more than ours.

Figure 5. The spectrogram of the baseline model (above)and our model (below) when 100 epochs were trained

Baseline OursSDR 0.91(12.36) 2.40(10.20)SIR Inf InfSAR 0.91(12.36) 2.40(10.20)

Table 1. The results of objective evaluation

Baseline OursQuestion1 2.6 3.6Question2 4.4 3Question3 3.2 4

Table 2. The results of objective evaluation

4.3 Evaluation with objective measure

To measure source separation performance objectively[9], we used three measurement: Signal-to-Distortion Ra-tio(SDR), Signal-to-Interference Ratio(SIR) and Signal-to-Artifact Ratio(SAR). Table 1 shows the results and thevalue in the table is median and standard deviation.

As we can see in the table 1, the values seems to havenot proper values. To see what the problem is, we analyzedthe formula for each:

SDR := 10 log 10||starget||2

||einterf+enoise+eartif||2

SIR := 10 log 10||starget||2

||einterf||2

SAR := 10 log 10||starget+einterf||

2

||enoise||2

where starget is a version of modified by an allowed dis-tortion for some signal sj and einterf, enoise, and eartif are,respectively, the interferences, noise, and artifacts errorterms. According above equations, einterf and eartif mightbe 0. We gave the audio waveform some ε to solve thisnumerical problem, but we could not solve it.

4.4 Evaluation with subjective measure

To complement the objective evaluation, we also evaluatesubjectively [3]. We asked three questions with 5 Likertscale to 5 people. Table 2 shows the results and the valuein the table is mean. The higher value is, the better. Thequestions are as follow:

• Question1: What is the quality of the vocal?

• Question2: How well have the instruments in the soundbeen removed?

• Question3: How disturbing is the noise?

As we can see in the table 2, the reconstructed vocalquality of our model is quite better than that of baselinemodel. Although the baseline model removes instrumentswell, noise in our model does not feel disturbing.

5. CONCLUSION AND DISCUSSION

We proposed a deeper model than U-Net and got betterresults in objective and subjective evaluation. Especially,the subjective result showed that the our model’s quality ofvocal is evaluated much better than baseline model. Alsoalthough our models had more noise in vocal sound, peoplerated the size of the vocals to be larger and more abundant,which is relatively less disturbing to noise. However, foreven higher quality of separated vocal, we should solve thenoise problem and find more efficient model structure.

The noise problem might be solved by adding anothernetwork proposed in SEGAN paper [5]. In addition, an at-tention mechanism is interesting technique for many deeplearning researchers and Oktay et al. [4] applied attentionmechanism to U-Net for image segmentation. Therefore,we think we can try to apply this technique to sound sourceseparation task.

We not only evaluated the results of the separate voices,but also observed how the model was learned as learningprogressed. As a result, the model first showed that lowfrequency bands were extracted and vocal tones of increas-ing frequency band were learned as the epoch progressedgradually.

6. FUTURE WORK

If we can extract clean source from songs, there are manyfunny applications. Mash-up is one of them. It is a tech-nique to mix naturally vocal and instrument from differentsongs. Because after separating vocal, we can also separateinstruments by reversing phase. Then, we might mix twosongs by using beat detection and pitch shifting together.

Figure 6. The Example of Mashup Website: raveDJ

7. AUTHORS’ CONTRIBUTION

We contributed almost processes equally. Duyeon Kim,especially, was in charge of ’Audio Reconstruction’ partand presented proposal. Jaehoon Oh, especially, was incharge of ’Feature Extraction’ part and presented spotlight.

8. REFERENCES

[1] Zhe-Cheng Fan, Yen-Lin Lai, and Jyh-Shing RogerJang. Svsgan: Singing voice separation via generativeadversarial network. arXiv preprint arXiv:1710.11428,2017.

[2] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. Generative adversarialnets. In Advances in neural information processing sys-tems, pages 2672–2680, 2014.

[3] Andreas Jansson, Eric Humphrey, Nicola Montecchio,Rachel Bittner, Aparna Kumar, and Tillman Weyde.Singing voice separation with deep u-net convolutionalnetworks. 2017.

[4] Ozan Oktay, Jo Schlemper, Loic Le Folgoc, MatthewLee, Mattias Heinrich, Kazunari Misawa, KensakuMori, Steven McDonagh, Nils Y Hammerla, BernhardKainz, et al. Attention u-net: Learning where to lookfor the pancreas. arXiv preprint arXiv:1804.03999,2018.

[5] Santiago Pascual, Antonio Bonafonte, and Joan Serra.Segan: Speech enhancement generative adversarialnetwork. arXiv preprint arXiv:1703.09452, 2017.

[6] Zafar Rafii, Antoine Liutkus, Fabian-Robert Stter,Stylianos Ioannis Mimilakis, and Rachel Bittner. TheMUSDB18 corpus for music separation, December2017.

[7] Olaf Ronneberger, Philipp Fischer, and Thomas Brox.U-net: Convolutional networks for biomedical imagesegmentation. In International Conference on Medicalimage computing and computer-assisted intervention,pages 234–241. Springer, 2015.

[8] Paris Smaragdis, Cedric Fevotte, Gautham J Mysore,Nasser Mohammadiha, and Matthew Hoffman. Staticand dynamic source separation using nonnegative fac-torizations: A unified view. IEEE Signal ProcessingMagazine, 31(3):66–75, 2014.

[9] Emmanuel Vincent, Remi Gribonval, and CedricFevotte. Performance measurement in blind audiosource separation. IEEE transactions on audio, speech,and language processing, 14(4):1462–1469, 2006.

Documents

VOCAL SEPARATION FROM SONGS USING DEEP …juhan/gct634/2018/finals/vocal... · 2018. 9. 14. · [5]Santiago Pascual, Antonio Bonafonte, and Joan Serra. Segan: Speech enhancement generative