Singing Voice Separation from Music with U-Net and

Final Presentation of Project, Gyeong-Tae Lee & Jae-Hak Jeong @ N25 #3229, Jun 13, 2019 1

GCT634 Spring 2019 - Musical Applications of Machine Learning

Property of KAIST, All Rights Reserved.

Gyeong-Tae Lee & Jae-Hak [email protected] / [email protected]

Human-Machine Interaction LABNOVIC+, Department of Mechanical Engineering, KAIST

Singing Voice Separation from Music with U-Net and Performance Evaluation by Genre

Final Presentation of GCT634 Project - #4@ N25 #3229 Paik Nam June Hall, Jun 13, 2019


GCT 634Musical Applications of MLBriefing of Final Project

- Paper review : 7 ( Traditional Method : 3 / Deep Learning Method : 4 )- Datasets collected : 6 ( ccMixter / MedleyDB / DSD100 / MUSDB18 / GTZAN / Mir-1k )- Code implemented : 1 ( U-Net )- Output obtained : Training with ccMixter + MUSDB18 / Test with MUSDB18

Singing VoiceSeparation


GCT 634Musical Applications of MLIntroduction

• The singing voice separation estimates what the vocal and accompaniment would sound like in isolation.• A clean vocal is helpful for some MIR tasks, such as singer identification and lyric transcription.• As a commercial application, it is evident that the karaoke industry would benefit from such technology.

• Traditional Methods- Bayesian Method [1]- REpeating Pattern Extraction Technique (REPET) [2]- Non-negative Matrix Factorization (NMF) [3]

• Deep Learning Methods- Deep Neural Network (DNN) [4]- Convolutional encoder-decoder architectures [5]- Deep U-Net Convolutional Networks (U-Net) [6]

[1] Alexey Ozerov, Pierrick Philippe, Frdric Bimbot, and Rmi Gribonval. Adaptation of bayesian models for single-channel source separation and its application to voice/music separation in popular songs. IEEE Transactions on Audio, Speech, and Language Processing, 15(5):1564–1578, 2007.[2] Paris Zafar Rafii and Bryan Pardo. Repeating pattern extraction technique (REPET): A simple method for music/voice separation. IEEE transactions on audio, speech, and language processing, 21(1):73–84, 2013.[3] Paris Smaragdis, Cedric Fevotte, Gautham J Mysore, Nasser Mohammadiha, and Matthew Hoffman. Static and dynamic source separation using nonnegative factorizations: A unified view. IEEE Signal Processing Magazine, 31(3):66–75, 2014.[4] Andrew JR Simpson, Gerard Roma, and Mark D Plumbley. Deep karaoke: Extracting vocals from musical mixtures using a convolutional deep neural network. In International Conference on Latent Variable Analysis and Signal Separation, pages 429–436. Springer, 2015.[5] Pritish Chandna, Marius Miron, Jordi Janer, and Emilia G´omez. Monoaural audio source separation using deep convolutional neural networks. In International Conference on Latent Variable Analysis and Signal Separation, pages 258–266. Springer, 2017.[6] Andreas Jansson, Eric Humphrey, Nicola Montecchio, Rachel Bittner, Aparna Kumar, and Tillman Weyde. Singing voice separation with deep u-net convolutional networks. In 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017.

U-Net Mask

Convolutional Encoder-Decoder Mask


GCT 634Musical Applications of MLVocal separation with U-Net

Encodes the image into a small and deep representation

Decoded to the original size of the image by upsampling layers

Low Level Skip Connection• Concatenate the layers at

same level• Information flow between

the high resolution input and output

Plain ReLU

Batch normalization

Deconv(transposed conv)

Decoder layer

conv2D

Batch normalization

Leaky ReLU

Encoder layer

Pre-processing• Downsampling 8192Hz• STFT window size 1024, hop 768,

extract patches 128 (11 sec.)• Spectrogram normalized

Training• Vocal & Mix in / Soft Mask out• Loss function: !","norm of the

difference between target and masked input spectrogram


GCT 634Musical Applications of MLEvaluation

• Quantitative evaluation of singing voice separation performance

• Signal to Distortion / Interference / Artifact Ratio

• Meaning

• SDR : how similar !"#$.&'()* with !+",.&'()*• SIR : how discriminative !"#$.&'()* with !-./$0+"• SAR: how well reconstructed from spectrogram

est.vocal ref.vocal mixture ref.vocal( , ) ( , )NSDR SDR s s SDR s s= -

2ref.vocal

10 2interf noise artif

2ref.vocal

10 2interf

2ref.vocal interf noise

10 2interf noise artif

: 10 log

: 10log

: 10log

sSDR

e e e

sSIR

e

s e eSAR

e e e

=+ +

=

+ +=

+ +

2ref.vocal interf

10 2noise

: 10 logs e

SNRe+

=

interf

artif

ee

est.vocal ref.vocal interf noise artifs s e e e= + + +

= interface: residual of the background music (inst.)

= artifact: noise due to the reconstruction (ex. glitch from STFT phase st.)

= normalized SDR : SDR without reference SDR


GCT 634Musical Applications of MLSDR / SAR / SIR

• Condition• Training: ccMixter + MUSDB18 (200 songs) / Test: MUSDB18• Source : downsampled to 8192 Hz, 47 sec of length• Calculates SDR: from 0:05 – 0:10 (start from 5 sec., 5 sec. length)


GCT 634Musical Applications of ML

GenreSinger

/Song WritingRock

NSDR 20.27 dB -6.959 dB

TitleAlexander Ross-Velvet Curtain

Music Delta-Hendrix

Mix

Est. Vocal

Best / Worst case


GCT 634Musical Applications of MLDiscussion

• Dependency on Genre• Best Genre : singer / song writing – low level accompaniment

à because of phase reuse, there is still jittering.• Worst Genre : rock – full accompaniment

• Outlier of Genre• Heavy metal: High level because of evaluation only on accompaniment• Electronic: most of audio files consists of only accompaniment• Reggae: looks like high performance but beat remains

• Minor Info.• MUSDB18 : wrong label ( from 17 to 118 )

• Future research• U-Net with Phase recover like Wave U-Net• Joint with Vocal detection

Documents

Singing Voice Separation from Music with U-Net and