Music Processing Applications of Music Processing · 2017. 2. 14. · Gabor Wavelet Spectrogram [dB] Time [Sec] Frequency [Hz] 0.5 1 1.5 2 277 407 599 880 1293 1901 2794 4106 6035

Music Processing

Christian Dittmar

Lecture

Applications of Music Processing

International Audio Laboratories [email protected]

Singing Voice Detection

Important pre-requisite for: Music segmentation Music thumbnailing (preview version) Singing voice transcription Singing voice separation Lyrics alignment Lyrics recognition


Detect singing voice activity during course of a recording Assumptions: Real-world, polyphonic music recordings are

analyzed Singing voice performs dominant melody above

accompaniment

10 15 20 25 30 35 40 45

Time in seconds


Challenges: Complex characteristics of singing voice Large diversity of accompaniment music Accompaniment may play same melody as singing Pitch-fluctuating instruments my be similar to singing

Stable pitch Fluctuating pitch


Common approach: Frame-wise extraction of audio features Classification via machine learning

10 15 20 25 30 35 40 45

Time in seconds

Audio Feature Extraction

Frame-wise processing: Hopsize Q Blocksize K Window function w(n) Signal frame x(n)

Compute for eachanalysis frame: Time-domain features Spectral features Cepstral feature others …


Time-domain features: Zero Crossing Rate (ZCR) High-pitched vs. Low-pitched

Linear Prediction Coeff. (LPC) Encodes spectral envelope


Spectral features: Spectrogram, linear vs. logarithmic frequency spacing Spectral Flatness (SF), Spectral Centroid (SC), and

many others …STFT Spectrogram [dB]

Time [Sec]

Freq

uenc

y [H

z]

0.5 1 1.5 2

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000

Gabor Wavelet Spectrogram [dB]

Time [Sec]

Freq

uenc

y [H

z]

0.5 1 1.5 2

277

407

599

880

1293

1901

2794

4106

6035

8870


Cepstral features: Singing voice as an example

Convolutive: excitation * filter Excitation: vibration of vocal folds Filter: resonance of the vocal tract

Magnitude spectrum Multiplicative: excitation · filter

Log-magnitude spectrum Additive: excitation + filter

“Liftering” Separation into smooth spectral

envelope and fine-structured excitation

0 0.5 1 1.5 2 2.5

x 104

Mag

nitu

de s

pect

rum

Extraction of spectral envelope via cepstral liftering

0 0.5 1 1.5 2 2.5

x 104Frequency (Hz)

Loga

rithm

ic m

agni

tude

Observed SpectrumSpectral EnvelopeExcitation Spectrum

Machine Learning

Application to audio signals: Speech recognition Speaker recognition Singing voice detection Genre classification Instrument recognition Chord recognition etc …

Machine Learning

Learning principles: Unsupervised learning

Find structures in data

Supervised learning Human observer provides „ground truth“

Semi-supervised learning Combination of above principles

Reinforcement learning Feedback of „confident“ classifications to

the training

The Feature Space

Geometric and algebraic interpretation of ML problems Features contain numerical values

Concatenation of several features Dimensionality M

The data set contains N observations Cardinality N

Illustrative Example SFM & SCF of 6 complex tones

1

0

1

0SC K

k

K

k

ks

kskf

1

0

1

0

1SFK

k

KK

k

ksK

ks

The Feature Space

Each feature has one value M=2

Number of observations N=6

258.62 0.59

512.73 0.99

550.13 0.92

146.50 0.27

47.93 0.01

43.95 0.01

SpectralCentroid

SpectralFlatness

M

N

lpNoiseTone.wav

noiseTone.wav

hpNoiseTone.wav

harmonicNoise.wav

pianoTone.wav

harmonicTone.wav

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

100

200

300

400

500

600

Spectral Flatness

Spe

ctra

l Cen

troid

Scatter plot of Spectral Flatness vs. Spectral Centroid

lpNoiseTone.wavnoiseTone.wavhpNoiseTone.wavharmonicNoiseTone.wavpianoTone.wavharmonicTone.wav

The Feature Space



Mapping of features SC to y-axis SF to x-axis Scatter plot with

unnormalized axes

The Feature Space



Mapping of features SC to y-axis SF to x-axis Scatter plot with

unnormalized axes

Target class labels Provided by manual

annotation

258.62 0.59

512.73 0.99

550.13 0.92

146.50 0.27

47.93 0.01

43.95 0.01

SpectralCentroid

SpectralFlatness

0

0

0

1

1

1

TargetLabels

⋮ ⋮

Classification methods

k-Nearest Neighbours (kNN)

Singing VoiceAccompanimentUnknown data



k-Nearest Neighbours (kNN)

L1-Dist. (Manhattan)

M

mmm yxd

11

L2-Dist. (Euclidean)2

12

M

mmm yxd

L∞-Dist. (Maximum)

MM yxyxd

,,max 11



Decision Trees (DT)



Random Forests (RF)


Gaussian Mixture Models (GMM)


∙ Σ


Gaussian Mixture Models (GMM)


Gauss components


Support Vector Machines (SVM)


sgn ,


Deep Neural Networks (DNN)


⋯ , ⋯ ,


Deep Neural Networks (DNN)


Loss function


Further methods: Hidden Markov Models

Transition probabilities between GMMs Sparse Representation Classifier

Sparse linear combination of training data Boosting

Combine many weak classifiers Convolutional Neural Networks Recurrent Neural Networks Multiple Kernel Learning others …

25

Mel-scale Frequency Cepstral Coefficients

Filter BankFrame

txGaussian Mixture Model (GMM)

x

11,Σ22 ,Σ ... GG Σ,

+w

1w

2w

G

)|( xp

N () N ()

V V V V V N N N N N N N NV V NNN N

Segment-by-Segment Classification

1

0

1

0)|(log)|(log

W

iMitW

W

iSitW pp xx

Singing

Accompaniment


Audio MosaicingSource signal: BeesTarget signal: Beatles–Let it be

Mosaic signal: Let it Bee

NMF-Inspired Audio Mosaicing

≈

. =

Non-negative matrix factorization (NMF)

Proposed audio mosaicing approach

≈

.

Non-negative matrix Components Activations

Target’s spectrogram Source’s spectrogram Activations Mosaic’s spectrogram

fixed

learnedfixed

learned

fixed

learned

[Driedger et al. ISMIR 2015]

=

Time source

Freq

uenc

y

Tim

e so

urce

Time targetTime target

Freq

uenc

y

Basic NMF-Inspired Audio Mosaicing

Time target

Freq

uenc

y

Time source

Freq

uenc

y

Freq

uenc

y

Tim

e so

urce


. =≈

Spectrogram target

Spectrogram source

SpectrogrammosaicActivation matrix


Time target

Freq

uenc

y

Time source

Freq

uenc

y

Freq

uenc

y

Tim

e so

urce


. =≈

Spectrogram target

Spectrogram source


Core idea: support the development of sparse diagonal activation structures

Activation matrix

Iterative updates

Preserve temporal context

Time target

Freq

uenc

y

Time source

Freq

uenc

y

Freq

uenc

y

Tim

e so

urce


. =≈

Spectrogram target

Spectrogram source



Time target

Freq

uenc

y

Time source

Freq

uenc

y

Freq

uenc

y

Tim

e so

urce


. =≈

Spectrogram target

Spectrogram source



Audio MosaicingSource signal: WhalesTarget signal: Chic–Good times

Mosaic signal

https://www.audiolabs-erlangen.de/resources/MIR/2015-ISMIR-LetItBee

Audio MosaicingSource signal: Race carTarget signal: Adele–Rolling in the Deep

Mosaic signal

https://www.audiolabs-erlangen.de/resources/MIR/2015-ISMIR-LetItBee

Drum Source Separation

2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.82 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8

Time (seconds)

Rel

ativ

e am

plitu

de

Log-

frequ

ency

V V V V

V

V

V

STFT

iSTFT

Time (seconds)

Drum Source Separation Signal Model

Drum Sound Separation Decomposition via NMFD

Row

s ofH

Time (seconds)Lateral slices from W

UU U

W

Log-

frequ

ency

Score-based information(drum notation)

Audio-based information(training drum sounds)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5

Log-

frequ

ency

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5

Drum Sound Separation

https://www.audiolabs-erlangen.de/resources/MIR/2016-IEEE-TASLP-DrumSeparation

Time (seconds)

Rel

ativ

e am

plitu

de

Documents

Music Processing Applications of Music Processing · 2017. 2. 14. · Gabor Wavelet Spectrogram [dB] Time [Sec] Frequency [Hz] 0.5 1 1.5 2 277 407 599 880 1293 1901 2794 4106 6035