Music Processing Applications of Music Processing · 2017. 2. 14. · Gabor Wavelet Spectrogram...

Preview:

Citation preview

Music Processing

Christian Dittmar

Lecture

Applications of Music Processing

International Audio Laboratories Erlangenchristian.dittmar@audiolabs-erlangen.de

Singing Voice Detection

Important pre-requisite for: Music segmentation Music thumbnailing (preview version) Singing voice transcription Singing voice separation Lyrics alignment Lyrics recognition

Singing Voice Detection

Detect singing voice activity during course of a recording Assumptions: Real-world, polyphonic music recordings are

analyzed Singing voice performs dominant melody above

accompaniment

10 15 20 25 30 35 40 45

Time in seconds

Singing Voice Detection

Challenges: Complex characteristics of singing voice Large diversity of accompaniment music Accompaniment may play same melody as singing Pitch-fluctuating instruments my be similar to singing

Stable pitch Fluctuating pitch

Singing Voice Detection

Common approach: Frame-wise extraction of audio features Classification via machine learning

10 15 20 25 30 35 40 45

Time in seconds

Audio Feature Extraction

Frame-wise processing: Hopsize Q Blocksize K Window function w(n) Signal frame x(n)

Compute for eachanalysis frame: Time-domain features Spectral features Cepstral feature others …

Audio Feature Extraction

Time-domain features: Zero Crossing Rate (ZCR) High-pitched vs. Low-pitched

Linear Prediction Coeff. (LPC) Encodes spectral envelope

Audio Feature Extraction

Spectral features: Spectrogram, linear vs. logarithmic frequency spacing Spectral Flatness (SF), Spectral Centroid (SC), and

many others …STFT Spectrogram [dB]

Time [Sec]

Freq

uenc

y [H

z]

0.5 1 1.5 2

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000

Gabor Wavelet Spectrogram [dB]

Time [Sec]

Freq

uenc

y [H

z]

0.5 1 1.5 2

277

407

599

880

1293

1901

2794

4106

6035

8870

Audio Feature Extraction

Cepstral features: Singing voice as an example

Convolutive: excitation * filter Excitation: vibration of vocal folds Filter: resonance of the vocal tract

Magnitude spectrum Multiplicative: excitation · filter

Log-magnitude spectrum Additive: excitation + filter

“Liftering” Separation into smooth spectral

envelope and fine-structured excitation

0 0.5 1 1.5 2 2.5

x 104

Mag

nitu

de s

pect

rum

Extraction of spectral envelope via cepstral liftering

0 0.5 1 1.5 2 2.5

x 104Frequency (Hz)

Loga

rithm

ic m

agni

tude

Observed SpectrumSpectral EnvelopeExcitation Spectrum

Machine Learning

Application to audio signals: Speech recognition Speaker recognition Singing voice detection Genre classification Instrument recognition Chord recognition etc …

Machine Learning

Learning principles: Unsupervised learning

Find structures in data

Supervised learning Human observer provides „ground truth“

Semi-supervised learning Combination of above principles

Reinforcement learning Feedback of „confident“ classifications to

the training

The Feature Space

Geometric and algebraic interpretation of ML problems Features contain numerical values

Concatenation of several features Dimensionality M

The data set contains N observations Cardinality N

Illustrative Example SFM & SCF of 6 complex tones

1

0

1

0SC K

k

K

k

ks

kskf

1

0

1

0

1SFK

k

KK

k

ksK

ks

The Feature Space

Each feature has one value M=2

Number of observations N=6

258.62 0.59

512.73 0.99

550.13 0.92

146.50 0.27

47.93 0.01

43.95 0.01

SpectralCentroid

SpectralFlatness

M

N

lpNoiseTone.wav

noiseTone.wav

hpNoiseTone.wav

harmonicNoise.wav

pianoTone.wav

harmonicTone.wav

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

100

200

300

400

500

600

Spectral Flatness

Spe

ctra

l Cen

troid

Scatter plot of Spectral Flatness vs. Spectral Centroid

lpNoiseTone.wavnoiseTone.wavhpNoiseTone.wavharmonicNoiseTone.wavpianoTone.wavharmonicTone.wav

The Feature Space

Each feature has one value M=2

Number of observations N=6

Mapping of features SC to y-axis SF to x-axis Scatter plot with

unnormalized axes

The Feature Space

Each feature has one value M=2

Number of observations N=6

Mapping of features SC to y-axis SF to x-axis Scatter plot with

unnormalized axes

Target class labels Provided by manual

annotation

258.62 0.59

512.73 0.99

550.13 0.92

146.50 0.27

47.93 0.01

43.95 0.01

SpectralCentroid

SpectralFlatness

0

0

0

1

1

1

TargetLabels

⋮ ⋮

Classification methods

k-Nearest Neighbours (kNN)

Singing VoiceAccompanimentUnknown data

Singing VoiceAccompanimentUnknown data

Classification methods

k-Nearest Neighbours (kNN)

L1-Dist. (Manhattan)

M

mmm yxd

11

L2-Dist. (Euclidean)2

12

M

mmm yxd

L∞-Dist. (Maximum)

MM yxyxd

,,max 11

Singing VoiceAccompanimentUnknown data

Classification methods

Decision Trees (DT)

Singing VoiceAccompanimentUnknown data

Classification methods

Random Forests (RF)

Classification methods

Gaussian Mixture Models (GMM)

Singing VoiceAccompanimentUnknown data

∙ Σ

Classification methods

Gaussian Mixture Models (GMM)

Singing VoiceAccompanimentUnknown data

Gauss components

Classification methods

Support Vector Machines (SVM)

Singing VoiceAccompanimentUnknown data

sgn ,

Classification methods

Deep Neural Networks (DNN)

Singing VoiceAccompanimentUnknown data

⋯ , ⋯ ,

Classification methods

Deep Neural Networks (DNN)

Singing VoiceAccompanimentUnknown data

Loss function

Classification methods

Further methods: Hidden Markov Models

Transition probabilities between GMMs Sparse Representation Classifier

Sparse linear combination of training data Boosting

Combine many weak classifiers Convolutional Neural Networks Recurrent Neural Networks Multiple Kernel Learning others …

25

Mel-scale Frequency Cepstral Coefficients

Filter BankFrame

txGaussian Mixture Model (GMM)

x

11,Σ22 ,Σ ... GG Σ,

+w

1w

2w

G

)|( xp

N () N ()

V V V V V N N N N N N N NV V NNN N

Segment-by-Segment Classification

1

0

1

0)|(log)|(log

W

iMitW

W

iSitW pp xx

Singing

Accompaniment

Singing Voice Detection

Audio MosaicingSource signal: BeesTarget signal: Beatles–Let it be

Mosaic signal: Let it Bee

NMF-Inspired Audio Mosaicing

. =

Non-negative matrix factorization (NMF)

Proposed audio mosaicing approach

.

Non-negative matrix Components Activations

Target’s spectrogram Source’s spectrogram Activations Mosaic’s spectrogram

fixed

learnedfixed

learned

fixed

learned

[Driedger et al. ISMIR 2015]

=

Time source

Freq

uenc

y

Tim

e so

urce

Time targetTime target

Freq

uenc

y

Basic NMF-Inspired Audio Mosaicing

Time target

Freq

uenc

y

Time source

Freq

uenc

y

Freq

uenc

y

Tim

e so

urce

Time targetTime target

. =≈

Spectrogram target

Spectrogram source

SpectrogrammosaicActivation matrix

Basic NMF-Inspired Audio Mosaicing

Time target

Freq

uenc

y

Time source

Freq

uenc

y

Freq

uenc

y

Tim

e so

urce

Time targetTime target

. =≈

Spectrogram target

Spectrogram source

SpectrogrammosaicActivation matrix

Core idea: support the development of sparse diagonal activation structures

Activation matrix

Iterative updates

Preserve temporal context

Time target

Freq

uenc

y

Time source

Freq

uenc

y

Freq

uenc

y

Tim

e so

urce

Time targetTime target

. =≈

Spectrogram target

Spectrogram source

SpectrogrammosaicActivation matrix

Basic NMF-Inspired Audio Mosaicing

Time target

Freq

uenc

y

Time source

Freq

uenc

y

Freq

uenc

y

Tim

e so

urce

Time targetTime target

. =≈

Spectrogram target

Spectrogram source

SpectrogrammosaicActivation matrix

Basic NMF-Inspired Audio Mosaicing

Audio MosaicingSource signal: WhalesTarget signal: Chic–Good times

Mosaic signal

https://www.audiolabs-erlangen.de/resources/MIR/2015-ISMIR-LetItBee

Audio MosaicingSource signal: Race carTarget signal: Adele–Rolling in the Deep

Mosaic signal

https://www.audiolabs-erlangen.de/resources/MIR/2015-ISMIR-LetItBee

Drum Source Separation

2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.82 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8

Time (seconds)

Rel

ativ

e am

plitu

de

Log-

frequ

ency

V V V V

V

V

V

STFT

iSTFT

Time (seconds)

Drum Source Separation Signal Model

Drum Sound Separation Decomposition via NMFD

Row

s ofH

Time (seconds)Lateral slices from W

UU U

W

Log-

frequ

ency

Score-based information(drum notation)

Audio-based information(training drum sounds)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5

Log-

frequ

ency

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5

Drum Sound Separation

https://www.audiolabs-erlangen.de/resources/MIR/2016-IEEE-TASLP-DrumSeparation

Time (seconds)

Rel

ativ

e am

plitu

de

Recommended