Vocal Melody Extraction from Polyphonic Audio with … · Polyphonic Audio with Pitched Accompaniment ... Vocal melody extraction from polyphonic audio ... 40 sec) from rock, R&B,

Vocal Melody Extraction from Polyphonic Audio with Pitched

Accompaniment

Vishweshwara Rao (05407001)Ph.D. Defense

Guide: Prof. Preeti Rao

(June 2011)Department of Electrical Engineering

Indian Institute of Technology Bombay

Department of Electrical Engineering , IIT Bombay 2 of 25of 40

OUTLINE

Introduction Objective, background, motivation, approaches & issues

Indian music

Proposed melody extraction system Design

Evaluation

Problems Competing pitched accompanying instrument

Enhancements for increasing robustness to pitched accompaniment Dual-F0 tracking

Identification of vocal segments by combination of static and dynamic features

Signal-sparsity driven window length adaptation

Graphical User Interface for melody extraction

Conclusions and Future work


INTRODUCTIONObjective

Vocal melody extraction from polyphonic audio Polyphony : Multiple musical sound sources present

Vocal : Lead melodic instrument is the singing voice

Melody Sequence of notes

Symbolic representation of music

Pitch contour of the singing voice

Note frequency

time


Pitch Perceptual attribute of sound Closely related to periodicity or fundamental frequency (F0)

Vocal pitch contour

INTRODUCTIONBackground

F0 = 1/ T0 = 300 HzF0 = 1/ T0 = 100 Hz


INTRODUCTIONMotivation, Complexity and Approaches

Motivation Music Information Retrieval applications Query-by-singing/humming (QBSH),

Artist ID, Cover Song ID Music Edutainment Singing learning, karaoke creation Musicology

Problem complexity Singing large F0 range, pitch dynamics, Diversity Inter-singer, across

cultures Polyphony Crowded signal Percussive & tonal instruments

Approaches Understanding without separation Source-separation [Lag08] Classification [Pol05] Voice F0

contour

Signal representation

Multi-F0 analysis

Predominant-F0 trajectory extraction

Voicing detection

Polyphonic audio signal


INTRODUCTIONIndian classical music: Signal characteristics

Time (sec)

Freq

uenc

y (H

z)

0 5 10 15 200

500

1000

1500

2000

Ghe Na Tun

Tabla (percussion)

Tanpura (drone)Singer

Harmonium (secondary melody)


INTRODUCTIONMelody extraction in Indian classical music

Issues Signal complexity

Singing Polyphony

Variable tonic Non-availability of ground-truth data

Almost completely improvised (no universally accepted notation)

Example

Thit Ke Tun


SYSTEM DESIGNOur Approach

Design considerations Singing Robustness to pitched accompaniment Flexible


Multi-F0 analysis

Predominant-F0 trajectory extraction

Singing voice detection

Polyphonic audio signal

Voice F0 contour


SYSTEM DESIGNSignal Representation

-500 0 500-20-10

0102030405060

Frequency

Mag

nitu

de (d

B)

x(m)

x(m)w(n-m)

0

( , )X n

w(n-m)

DFT

( ) ( )2 1

0( , )

iM mM

mX n x m w n m e

==

Frequency transform of

a 40 ms Hamming window

Frequency domain representation Pitched sounds have harmonic spectra Short-time analysis and DFT Window-length

Chosen to resolve harmonics of minimum expected F0

Sinusoidal representation More compact & relevant Different methods of sinusoid ID

Magnitude-based Phase-based

Main-lobe matching (Sinusoidality) [Grif88]method found to be most reliable Frequency transform of window has a

known shape Local peaks whose shape closely matches

window main-lobe are declared as sinusoids


SYSTEM DESIGNMulti-F0 Analysis

Objective To reliably detect the voice-F0 in polyphony with a high salience

F0-candidate identification Sub-multiples of well-formed sinusoids (Sinusoidality>0.8)

F0-salience function Typical salience functions

Maximize Auto-correlation function (ACF) Maximize comb-filter output

Harmonic sieve-type [Pol07] Sensitive to strong harmonic sounds

Two-way mismatch [Mah94] Error function sensitive to the deviation of measured

partials/sinusoids from ideal harmonic locations

F0-candidate pruning Sort in ascending order of TWM errors Prune weaker F0-candidates in close vicinity

(25 cents) of stronger F0 candidates

PresenterPresentation NotesTWM error biased towards voice because of large spectral spread of voice (till 5 kHz). Harmonics always predicted till 5 kHz.


SYSTEM DESIGNPredominant-F0 Trajectory Extraction

-0.4 -0.2 0 0.2 0.40

0.2

0.4

0.6

0.8

1

Log change in pitch

Normalized distribution of adjacent frame pitch transitions for male & female singers (Hop =10 ms)

-1 -0.5 0 0.5 10

0.2

0.4

0.6

0.8

1Cost functions

Log change in pitch

Std. Dev = 0.1

OJC = 1.0

( )2W(p,p ) OJC. log p '/ p =

( ) ( )( )22 2log p ' log p2W(p ,p ') 1 e

=

p and p are F0s in current and previous frames resp.

Objective To find that path through the F0-candidate v/s time

space that best represents the predominant-F0 trajectory

Dynamic-programming [Ney83] based path finding Measurement cost = TWM error Smoothness cost must be based on musicological

considerations


Genre Audio content PA (%) CA (%)

Indianclassicalmusic

Voice + percussion 97.4 99.5

Voice + percussion + drone 92.9 98.2

Voice + percussion + drone + harmonium 67.3 73.04

Indian popmusic Voice + guitar 96.7 96.7

Parameter Value

Frame length 40 ms

Hop 10 ms

Lower limit on F0 100 Hz

Upper limit on F0 1280 Hz

Upper limit on spectral content 5000 Hz

Data Classical: 4 min. of multi-track data, Film: 2 min. of multi-track data Ground truth: Output of YIN PDA [Chev02] on clean voice tracks with manual

correction Evaluation metrics

Pitch Accuracy (PA) = % of vocal frames whose pitch has been correctly tracked (within 50 cents)

Chroma Accuracy (CA) = PA except that octave errors are forgiven

EVALUATIONPredominant-F0 extraction: Indian Music


SYSTEM DESIGNVoicing Detection

Frame-level After grouping

Feature set Vocal recall (%)Instrumental

recall (%)Vocal recall (%)

Instrumental recall (%)

FS1 92.17 66.43 97.86 61.61

FS2 92.38 66.29 96.28 68.98

FS3 89.05 92.10 93.15 96.58

Features FS1 13 MFCCs FS2 7 static timbral features FS3 Normalized harmonic energy (NHE)

Classifier GMM 4 mixtures per class

Boundary detector Audio novelty detector [Foote] with NHE

Data 23 min. of Hindustani training data 7 min. of Hindustani testing data

Results on testing data

Recall: % of actual frames that were correctly labeled

Feature Extraction

Classifier

Boundary detector Grouping

Polyphonic signal

Decision labels

PresenterPresentation NotesTried combining FS2 and FS3. did not increase performance. FS2 arrived at using MI and cross-validation experiments.


EVALUATIONSubmission to MIREX 2008 & 2009

Music Signal


Predominant-F0 trajectory

extraction

Vocal segment pitch tracks

Sinusoids frequencies and magnitudes

DFT Main Lobe MatchingParabolic

interpolation

TWM error computation

Sort (Ascending)

Sub-multiples of sinusoids

F0 search range

Vicinity Pruning

F0

candidates

Multi-F0 analysis

F0 candidates and measurement costs

Dynamic programming-based optimal path finding

Predominant F0 contour

Thresholding normalized

harmonic energy

Grouping over homogenous

segmentsVoicing

detection

Music Information Retrieval Evaluation eXchange Started in 2004 International Music Information

Retrieval Systems Evaluation Laboratory (IMIRSEL)

Common platform for evaluation on common datasets

Tasks Audio genre, artist, mood

classification Audio melody extraction Audio beat tracking Audio Key detection Query by singing/ hummin Audio chord estimation


EVALUATIONMIREX 2008 & 2009 Datasets & Evaluation

Data ADC 2004: Publicly available data

20 excerpts (about 20 sec each) from pop, opera, jazz & midi MIREX 2005: Secret data

25 excerpts (10 40 sec) from rock, R&B, pop, jazz, solo piano MIREX 2008: ICM data

4 excerpts of 1 minute each from a male and female Hindustani vocal performance. 2 min. each with and without a loud harmonium

MIREX 2009: MIR 1K data 374 Karaoke recordings of Chinese songs. Each recording is mixed at 3 different

Signal-to-accompaniment ratios (SARs) {-5,0,5 dB}

Evaluation metrics: Pitch evaluation

Pitch accuracy (PA) and Chroma accuracy (CA) Voicing evaluation

Vocal recall (Vx recall) and Vocal false alarm rate (Vx false alm) Overall accuracy

% of correctly detected vocal frames with correctly detected pitch Run-time

PresenterPresentation NotesOverall accuracy is computed as proportion of frames correctly labeled with pitch and voicing. Since there is a strong bias 80/20 towards voiced frames. The more voiced frames detected the better overall accuracy you get. We set our voicing threshold as per the best trade-off obtained between voiced/instrumental frame detection. If we bias the voicing threshold towards detecting more voiced frames, the overall accuracy was found to improve and come on par with the best systems.


EVALUATIONMIREX 2009 & 2010 MIREX05 dataset (vocal)

2009

Participant VxRecallVx

False AlmPitch

accuracyChromaaccuracy

Overall Accuracy

Runtime(dd:hh:mm)

cl1 91.2267 67.7071 70.807 73.924 59.7095 -cl2 80.9512 44.8062 70.807 73.924 64.4609 -dr1 93.7543 53.5255 76.1145 77.7138 66.9613 -dr2 88.0635 38.0207 70.9258 75.9161 66.5175 -hjc1 65.8441 19.8206 62.6594 73.4868 54.8538 -hjc2 65.8441 19.8206 54.1294 69.4221 52.2905 -jjy 88.8546 41.9813 76.2696 79.3236 66.3123 -kd 82.6017 15.2554 77.4622 80.8218 76.962 -mw 99.9345 99.7947 75.7398 80.3791 53.7323 -pc 75.6436 21.0879 71.7068 72.5089 70.465 -rr 92.908 56.3639 75.9506 79.1084 65.7745 -

toos 91.2267 67.7071 70.807 73.924 59.7095 -2010

HJ1 70.80 44.90 71.70 74.93 53.89 00:59:31TOOS1 84.65 41.93 68.87 74.62 60.84 00:50:31JJY2 96.86 69.62 70.17 78.32 60.81 01:09:30JJY1 97.33 70.16 71.58 78.61 61.54 03:48:02SG1 76.39 22.78 61.80 73.70 62.13 00:08:15



EVALUATIONMIREX 2009 & 2010 MIREX09 dataset (0 dB mix)

2009

Participant VxRecallVx

False AlmPitch

accuracyChromaaccuracy

Overall Accuracy

Runtime(dd:hh:mm)

cl1 92.4858 83.5749 59.138 62.9508 43.9659 00:00:28cl2 77.2085 59.7352 59.138 62.9508 49.2294 00:00:33dr1 91.8716 55.3555 69.8804 72.5138 60.1294 16:00:00dr2 87.3985 47.3422 66.549 70.7923 59.5076 00:08:44hjc1 34.1722 1.7909 72.6577 75.2906 53.1752 00:05:44hjc2 34.1722 1.7909 51.6871 70.002 51.7469 00:09:38jjy 38.906 19.4063 75.9354 80.2461 49.686 02:14:06kd 91.1846 47.7842 80.4565 81.8811 68.2237 00:00:24mw 99.992 99.4688 67.2905 71.0018 43.6365 00:02:12pc 73.1175 43.4773 50.8895 53.3672 51.5001 03:05:57rr 88.8091 50.7595 68.6242 71.3714 60.7733 00:00:26

toos 99.9829 99.4185 82.2943 85.7474 53.5623 01:00:282010

HJ1 82.06 14.27 83.15 84.23 76.17 14:39:16TOOS1 94.17 38.58 82.59 86.18 72.23 12:07:21JJY2 98.33 70.62 81.29 83.83 62.55 14:06:20JJY1 98.34 70.65 82.20 84.57 62.90 65:21:11SG1 89.65 30.22 80.05 85.50 73.59 01:56:27



No bold increase in melody extraction over the last 3 years (2007-2009) [Dres2010]

Errors due to loud pitched accompaniment Accompaniment pitch tracked instead of voice

Error in Predominant-F0 trajectory extraction

Accompaniment pitch tracked along with voice Error in voicing detection

Errors due to signal dynamics Octave errors due to fixed window length

Error in Signal representation

EVALUATION Problems in Melody Extraction


Incorrect tracking of loud pitched accompaniment

ICM Data Largest reduction in accuracy for audio in which Voice displays large rapid modulations Instrument pitch is flat

Predominant F0 trajectory DP-based path finding Based on suitably defined

Measurement cost Smoothness cost

Accompaniment errors Bias in measurement cost: Salient (spectrally rich) instrument Bias in smoothness cost: Stable-pitched instrument

ENHANCEMENTS: PREDOMINANT-F0 TRACKINGProblems


Extension of DP to track F0 candidate ordered pairs (nodes) called Dual-F0 tracking

Node formation All possible pairs

Computationally expensive (10P2 = 90) F0 and (sub) multiple may be tracked

Prohibit pairing of harmonically related F0 candidates Low harmonic threshold of 5 cents Allows pairing of voice F0 and octave-separated instrument F0 because of voice

detuning

Node measurement cost computation Joint TWM error [Mah94]

Node smoothness cost computation Sum of corresponding F0 candidate smoothness costs

Final selection of predominant-F0 contour Based on voice-harmonic instability

ENHANCEMENTS: PREDOMINANT-F0 TRACKINGDesign & Implementation

PresenterPresentation NotesnPk = fact(n)/fact(n-k). Also modeling interference. At most one other competing pitched instrument will be present.


Harmonic Sinusoidal Model (HSM) Partial tracking algorithm used in SMS [Serra98] Tracks are indexed and linked by harmonic number

Std. dev. pruning Prune tracks in 200 ms segments whose std. dev.


ENHANCEMENTS: PREDOMINANT-F0 TRACKINGFinal Implementation Block Diagram

Music Signal



DFTMain-lobe matching

Parabolic interpolation

TWM errorcomputation

Sorting (Ascending)


F0 search

range

Vicinity pruning

F0

candidates

Multi-F0 analysis

F0 candidates and saliences

Optimal path finding

Ordered pairingof F0 candidates

with harmonic constraintJoint TWM error

computation

Nodes (F0 pairs)

Melodic contour


Melodic contour

Predominant-F0 trajectory extractionVocal pitch

identification

PresenterPresentation NotesInsert box between dual-F0 output and Single-F0 output


Participating systems TWMDP (single- and dual-F0) LIWANG [LiWang07]

Uses HMM to track predominant-F0 Includes the possibility of a 2-pitch hypothesis but finally outputs a single F0 Shown to be superior to other contemporary systems

Same F0 search range [80 500 Hz]

Evaluation metrics Multi-F0 stage

% presence of true voice F0 in candidate list Predominant-F0 extraction (PA & CA)

Single F0 Dual-F0

Final contour accuracy Either-Or accuracy : Correct pitch is present in at least one of the two outputs

ENHANCEMENTS: PREDOMINANT-F0 TRACKINGExperimental evaluation: Setup


ENHANCEMENTS: PREDOMINANT-F0 TRACKINGExperimental Evaluation: Data & Results

DATASET DESCRIPTION VOCAL(SEC)TOTAL(SEC)

1 Li & Wang data 55.4 97.5

2 Examples from MIR-1k dataset with loud pitched accompaniment 61.8 98.1

3 Examples from MIREX08 data (Indian classical music) 91.2 99.2

TOTAL 208.4 294.8

DATASETPERCENTAGE PRESENCE OF VOICE-F0 (%)

TOP 5CANDIDATES

TOP 10 CANDIDATES

1 92.9 95.4

2 88.5 95.1

3 90.0 94.1

10 5 0 -50

20

40

60

80

100

SARs (dB)

Pitch

acc

urac

ies (%

)

(a)

LIWANG TWMDP

10 5 0 -50

20

40

60

80

100

SARs (dB)

Chr

oma

accu

raci

es (%

)

(b)

LIWANG TWMDP

Multi-F0 EvaluationData

Comparison of TWMDP single-F0 tracker and LIWANG for dataset 1

PresenterPresentation NotesInsert sound examples


ENHANCEMENTS: PREDOMINANT-F0 TRACKINGExperimental evaluation: Results (Dual-F0)

TWMDP (% IMPROVEMENT OVER LIWANG (A1))

DATA-SET

SINGLE-F0 (A2)DUAL-F0

EITHER-OR FINAL (A3)

1PA (%) 88.5 (8.3) 89.3 (0.9) 84.1 (2.9)

CA (%) 90.2 (6.4) 92.0 (1.1) 88.8 (3.9)

2PA (%) 57.0 (24.5) 74.2 (-6.8) 69.1 (50.9)

CA (%) 61.1 (14.2) 81.2 (-5.3) 74.1 (38.5)

3PA (%) 66.0 (11.3) 85.7 (30.2) 73.9 (24.6)

CA (%) 66.5 (9.7) 87.1 (18.0) 76.3 (25.9)

TWMDP Single-F0 significantly better than LIWANG system for all datasets

TWMDP Dual-F0 significantly better than TWMDP single-F0 for datasets 2 & 3

Scope for further improvement in final predominant-F0 identification Indicated by difference between TWMDP Dual-F0 Either-Or and Final accuracies

0

10

20

30

40

50

60

70

80

D2 (PA) D2 (CA) D3(PA) D3 (CA)

A3A2A1

PresenterPresentation NotesState reasons for TWMDP better than LIWANG


ENHANCEMENTS: PREDOMINANT-F0 EXTRACTIONExample of F0 collisions

Contour switching occurs at F0 collisions

1

1.5

2F0

(Oct

aves

ref.

110

Hz)

(a) Single-F0 tracking

1

1.5

2(b) Dual-F0 tracking (intermediate)

0 1 2 3 41

1.5

2

Time (sec)

(c) Dual-F0 Tracking (final)

Ground truth voice pitchGround truth harmonium pitchSingle-F0 output

Ground truth voice pitchDual F0 contour 1Dual F0 contour 2

Ground truth voice pitchDual F0 final output

PresenterPresentation NotesInsert vocal harmony example


ENHANCEMENTS: VOICING DETECTIONProblems


ENHANCEMENTS: VOICING DETECTIONFeatures

C1Static timbral

C2Dynamic timbral

C3Dynamic F0-Harmonic

F0 10 Harmonic powers Mean & median of F0

10 Harmonic powers SC & SE Mean, median & Std.Dev. of Harmonic [0 2 kHz]

Spectral centroid (SE) Std. Dev. of SC for 0.5, 1& 2 secMean, median & Std.Dev. of Harmonic [2

5 kHz]

Sub-band energy (SE) MER of SC for 0.5, 1 & 2secMean, median & Std.Dev. of Harmonics 1

to 5

Std. Dev. of SE for 0.5, 1 & 2 sec

Mean, median & Std.Dev. of Harmonics 6 to10

MER of SE for 0.5, 1 & 2 secMean, median & Std.Dev. of Harmonics 1

to10

Ratio of mean, median & Std.dev. of Harmonics 1 to 5 : Harmonics 6 to 10

Proposed feature set combination of static & dynamic features

Features extracted using a harmonic sinusoidal model representation

Feature selection in each feature set using information entropy [Weka]

MER Modulation energy ratio


ENHANCEMENTS: VOICING DETECTIONData

Genre Singing Dominant InstrumentI

WesternSyllabic. No large pitch modulations. Voice often softer than instrument.

Mainly flat-note (piano, guitar). Pitch range overlapping with voice.

II

GreekSyllabic. Replete with fast, pitch

modulations.

Equal occurrence of flat-note plucked-string /accordion and of pitch-modulated

violin.

III

Bollywood

Syllabic. More pitch modulations than western but lesser than other Indian

genres.

Mainly pitch-modulated wood-wind & bowed instruments. Pitches often

much higher than voice.

IV

Hindustani

Syllabic and melismatic. Varies from long, pitch-flat, vowel-only notes to

large & rapid modulations.

Mainly flat-note harmonium (woodwind). Pitch range overlapping

with voice.

VCarnatic

Syllabic and melismatic. Replete with fast pitch modulations.

Mainly pitch-modulated violin. F0 range generally higher than voice but

has some overlap in pitch range.

Genre Number of songsVocal

durationInstrumental

durationOverall

durationI. Western 11 7m 19s 7m 02s 14m 21sII. Greek 10 6m 30s 6m 29s 12m 59s

III. Bollywood 13 6m 10s 6m 26s 12m 36sIV. Hindustani 8 7m 10s 5m 24s 12m 54s

V. Carnatic 12 6m 15s 5m 58s 12m 13sTotal 45 33m 44s 31m 19s 65m 03s


0.5 0.6 0.7 0.8 0.9 10.6

0.7

0.8

0.9

1

Vocal Recall

Voca

l Pre

cision

Vocal Precision v/s Recall curves for different feature sets across genres in 'Leave 1 song out' experiment

MFCCC1C1+C2+C3

ENHANCEMENTS: VOICING DETECTIONEvaluation

Two cross-validation experiments Intra-genre Leave 1 song out Inter-genre Leave 1 genre out

Feature combination Concatenation Classifier combination

Baseline features 13 MFCCs [Roc07]

EvaluationVocal Recall (%) and precision (%)

Overall Results C1 better than baseline C1+C2+C3 better than C1 Classifier combination better than feature

concatenation


ENHANCEMENTS: VOICING DETECTIONEvaluation (contd.)

I II III IV V Total

Baseline 77.2 66.0 65.6 82.6 83.2 74.9

F0-MFCCs 76.9 72.3 70.0 78.9 83.0 76.2

C1 81.1 67.8 74.8 78.9 84.5 77.4

C1+C2 82.9 78.5 82.8 79.5 85.0 81.7

C1+C3 81.5 72.9 77.6 83.9 84.9 80.2

C1+C2+C3 82.1 81.1 83.5 83.0 84.7 82.8

I II III IV V Total

Baseline 77.2 66.0 65.6 82.6 83.2 74.9

F0-MFCCs 78.9 77.8 78.0 85.9 85.9 81.2

C1 79.6 77.4 79.3 82.3 87.1 81.0

C1+C2 82.3 83.6 85.4 83.3 86.8 84.2

C1+C3 80.2 83.4 81.7 89.7 88.2 84.5

C1+C2+C3 81.1 86.9 86.4 88.5 87.3 85.9

Leave 1 Song out (Recall %)

Semi-automatic F0-driven HSM Fully-automatic F0-driven HSM

Genre-specific feature set adaptation C1+C2 Western C1+C3 - Hindustani


Relation between window length and signal characteristics Dense spectrum (multiple harmonic sources) -> long window Non-stationarity (rapid pitch modulations) -> short window

Adaptive time segmentation for signal modeling and synthesis [Good97] Based on minimizing reconstruction error between synthesized and original

signals High computational cost

Easily computable measures for adapting window length Signal sparsity sparse spectrum has concentrated components

Window length selection (23.2, 46.4 92.9 ms) based on maximizing signal sparsity

ENHANCEMENTS: SIGNAL REPRESENATIONSparsity-driven window length adaptation

L2 Norm Normalized kurtosis Gini Index Hoyer measure Spectral flatness

( )22 nk

L X k=

4

22

1 ( )

1 ( )

nk

nk

X k XN

KU

X k XN

=

( )

1

0.51 2k

n

k

X N kGIX N

+ =

( )( ) 1

21

( )

nk

nk

X kHO N N

X k

=

2

2

( )

1 ( )

N nk

nk

X kSF

X kN

=

PresenterPresentation NotesXbar is mean spectral amplitude. X superscript k is X(k) sorted in ascending order to give ordered magnitude spectral coefficients. Maximize KU, HO, GI and minimize L2 and ASF


ENHANCEMENTS: SIGNAL REPRESENATIONSparsity-driven window length adaptation [contd.]

Experimental comparison between fixed and adaptive schemes Fixed and adaptive window lengths (different sparsity measures)

Sinusoid detection by main-lobe matching

Data Simulations: Two sound mixtures (Polyphony) and vibrato signal Real: Western pop (Whitney, Mariah) and Hindustani taans

Evaluation metrics Recall (%) and frequency deviation (Hz) Expected harmonic locations computed from ground-truth pitch

Results1. Adaptive higher recall and

lower frequency deviation

2. Kurtosis driven adaptation is superior than other sparsitymeasures


Generalized music transcriptions system still unavailable

Solution [Wang08] Semi-automatic approach Application-specific design E.g. music tutoring

Two, possibly independent, aspects of melody extraction Voice pitch extraction Manually difficult Vocal segment detection Manually easier

Semi-automatic tool Goal: To facilitate the extraction & validation of the voice pitch in polyphonic

recordings with minimal human intervention Design considerations

Accurate pitch detection Completely parametric control User-friendly control for vocal segment detection

GRAPHICAL USER INTERFACEMotivation

PresenterPresentation NotesWe are not going to neglect voicing detection. In fact we have devised a feature that exploits pitch-instability to discriminate between voice and stable-pitch instruments [ICSLP09]


Salient features Melody extraction back-end

Validation Visual: Spectrogram Aural: Re-synthesis

Segmental parameter variation

Easy non-vocal labeling

Saving final result & parameters

Selective use of dual-F0 tracker Switching between contours

GRAPHICAL USER INTERFACEDesign

A Waveform viewerB Spectrogram & pitch viewC Menu barD Controls for viewing, scrolling, playback

& volume controlE Parameter windowF Log viewer

PresenterPresentation NotesWe are not going to neglect voicing detection. In fact we have devised a feature that exploits pitch-instability to discriminate between voice and stable-pitch instruments [ICSLP09]


CONCLUSIONS AND FUTURE WORKFinal system block diagram

Music Signal



DFTMain-lobe matching

Parabolic interpolation

TWM errorcomputation

Sorting (Ascending)


F0 search

range

Vicinity pruning

F0

candidates

Multi-F0 analysis

F0 candidates and saliences


Ordered pairingof F0 candidates

with harmonic constraintJoint TWM error

computation

Nodes (F0 pairs and saliences)




Predominant F0 trajectory

extraction

Vocal pitch identification

Harmonic Sinusoidal Model

Feature Extraction

Classifier

Grouping

Boundary Deletion

Voice Pitch Contour

Voicing Detector


State-of-the-art melody extraction system designed by making careful choices for system modules

Enhancements to above system increase robustness to loud, pitched accompaniment Dual-F0 tracking for predominant-F0 extraction Combination of static & dynamic, timbral & F0-harmonic features for

voicing detection

Fully-automatic, high accuracy melody extraction still not feasible Large variability in underlying signal conditions due to diversity of music

A priori knowledge of music and signal conditions Male/Female singer Rate of pitch variation

High accuracy melodic contours can be extracted using a semi-automatic approach

CONCLUSIONS AND FUTURE WORKConclusions


CONCLUSIONS AND FUTURE WORKSummary of contributions

Design & validation of a novel, practically useful melody extraction system with increased robustness to pitched accompaniment Signal representation

Choice of main-lobe matching criterion for sinusoid identification Improved sinusoid detection by signal sparsity driven window length adaptation

Multi-F0 analysis Choice of TWM error as salience function Improved voice-F0 detection by separation of F0 candidate identification & salience

computation Predominant-F0 trajectory extraction

Gaussian log smoothness cost Dual-F0 tracking Final predominant-F0 contour identification by voice-harmonic instability

Voicing detection Use of predominant-F0-derived signal representation Combination of static and dynamic, timbral and F0-harmonic features

Design of a novel graphical user interface for semi-automatic use of the melody extraction system


Melody Extraction Identification of single predominant-F0 contour from dual-F0 output

Use of dynamic features F0 collisions

Detection based on minima in difference of constituent F0s of nodes Correction allowing pairing of F0 with itself around these locations Use of prediction-based partial tracking [Lag07]

Validation across larger, more diverse datasets Incorporate predictive path-finding in DP algorithm Extend algorithm to instrumental pitch tracking in polyphony

Homophonic music Lead instrument (e.g. flute) with accompaniment Polyphonic instruments (sitar)

Applications of Melody Extraction Singing evaluation & feedback QBSH systems Musicological studies

CONCLUSIONS AND FUTURE WORKFuture work


International Journals

V. Rao, P. Gaddipati and P. Rao, Signal-driven window adaptation for sinusoid identification in polyphonicmusic, IEEE Transactions on Audio, Speech, and Language Processing, 2011. (accepted)

V. Rao and P. Rao, Vocal melody extraction in the presence of pitched accompaniment in polyphonic music,IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no. 8, pp. 21452154, Nov. 2010.

International Conferences

V. Rao, C. Gupta and P. Rao, Context-aware features for singing voice detection in polyphonic music, 9thInternational Workshop on Adaptive Multimedia Retrieval, 2011. (submitted for review).

V. Rao, S. Ramakrishnan and P. Rao, Singing voice detection in polyphonic music using predominant pitch, inProceedings of InterSpeech, Brighton, U.K., 2009.

V. Rao and P. Rao, Improving polyphonic melody extraction by dynamic programming-based dual-F0 tracking,in Proceedings of the 12th International Conference on Digital Audio Effects (DAFx), Como, Italy, 2009.

V. Rao and P. Rao, Vocal melody detection in the presence of pitched accompaniment using harmonicmatching methods, in Proceedings of the 11th International Conference on Digital Audio Effects (DAFx),Espoo, Finland, 2008.

A. Bapat, V. Rao and P. Rao, Melodic contour extraction for Indian classical vocal music, in Proceedings ofMusic-AI (International Workshop on Artificial Intelligence and Music) in IJCAI, Hyderabad, India, 2007.

V. Rao and P. Rao, Melody extraction using harmonic matching, in Proceedings of the Music InformationRetrieval Exchange MIREX 2008 & 2009, URL: http://www.music-ir.org/mirex/abstracts/2009/RR.pdf

CONCLUSIONS AND FUTURE WORKList of related publications


CONCLUSIONS AND FUTURE WORKList of related publications [contd.]

National Conferences

S. Pant, V. Rao and P. Rao, A melody detection user interface for polyphonic music, in Proceedings of NationalConference on Communication (NCC), Chennai, India, 2010.

N. Santosh, S. Ramakrishnan, V. Rao and P. Rao, Improving singing voice detection in the presence of pitchedaccompaniment, in Proceedings of National Conference on Communication (NCC), Guwahati, India, 2009.

V.Rao, S. Pant, M. Bhaskar and P. Rao, Applications of a semi-automatic melody extraction interface for Indianmusic, in Proceedings of International Symposium on Frontiers of Research in Speech and Music (FRSM),Gwalior, India, Dec. 2009.

V. Rao, S. Ramakrishnan and P. Rao, Singing voice detection in north Indian classical music, in Proceedings ofNational Conference on Communication (NCC), Mumbai, India, 2008.

V. Rao and P. Rao, Objective evaluation of a melody extractor for north Indian classical vocal performances, in Proceedings of International Symposium on Frontiers of Research in Speech and Music (FRSM), Kolkota, India, 2008.

V. Rao and P. Rao, Vocal trill and glissando thresholds for Indian listeners, in Proceedings of International Symposium on Frontiers of Research in Speech and Music (FRSM), Mysore, India, 2007.

Patent

P. Rao, V. Rao and S. Pant, A device and method for scoring a singing voice, Indian Patent Application, No. 1338/MUM/2009, Filed June 2, 2009.


REFERENCES[Pol07] G. Poliner, D. Ellis, A. Ehmann, E. Gomez, S. Streich and B. Ong, Melody transcription from music audio:

Approaches and evaluation, IEEE Trans. Audio, Speech, Lang., Process., vol. 15, no. 4, pp. 12471256, May 2007.

[Grif88] D. Griffin and J. Lim, Multiband Excitation Vocoder, IEEE Trans. on Acoust., Speech and Sig. Process., vol. 36, no. 8, pp. 12231235, 1988.

[Wang08] Y. Wang and B. Zhang, Application-specific music transcription for tutoring, IEEE Multimedia, vol. 15, no. 3, pp. 7074, 2008.

[Chev02] A. de Cheveign and H. Kawahara, YIN, a Fundamental Frequency Estimator for Speech and Music. J. Acoust. Soc. America, vol. 111, no. 4, pp. 19171930, 2002.

[LiWang07] Y. Li and D. Wang, Separation of singing voice from music accompaniment for monoauralrecordings, IEEE Trans. Audio, Speech and Lang. Process., vol. 15, no. 4, pp. 1475-1487, 2007.

[Mah94] R. Maher and J. Beauchamp, Fundamental Frequency Estimation of Musical Signals using a Two-Way Mismatch Procedure, J. Acoust. Soc. Amer., vol. 95, no. 4, pp. 22542263, Apr. 1994.

[Dres2010] K. Dressler, Audio melody extraction for MIREX 2009, Ilmenau: Fraunhofer IDMT, 2010.

[Ney83] H. Ney, Dynamic Programming Algorithm for Optimal Estimation of Speech Parameter Contours, IEEE Trans. Systems, Man and Cybernetics, vol. SMC-13, no. 3, pp. 208214, Apr. 1983.

[Lag07] M. Lagrange, S. Marchand and J. B. Rault, Enhancing the tracking of partials for the sinusoidal modeling of polyphonic sounds, IEEE Trans. Audio, Speech and Lang. Process., vol. 15, no. 5, pp. 1625-1634, 2007.

[Chao09] C. Hsu and R. Jang, On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset, IEEE Trans. Audio, Speech, and Lang. Process., 2009, accepted.

[Good97] M. Goodwin, Adaptive signal models: Theory, algorithms and audio applications, Ph. D. dissertation, MIT, 1997.

[Lag08] M. Lagrange, L. Martins, J. Murdoch and G. Tzanetakis, Normalised cuts for predominant melodic source separation, IEEE Trans. Audio, Speech, Lang., Process. (Sp. Issue on MIR), vol. 16, no.2, pp. 278290, Feb. 2008.

[Pol05] G. Poliner and D. Ellis, A classification approach to melody transcription, in Proc. Intl. Conf. Music Information Retrieval, London, 2005.

Vocal Melody Extraction from Polyphonic Audio with Pitched AccompanimentOUTLINEINTRODUCTIONObjectiveINTRODUCTIONBackgroundINTRODUCTIONMotivation, Complexity and ApproachesINTRODUCTIONIndian classical music: Signal characteristicsINTRODUCTIONMelody extraction in Indian classical musicSYSTEM DESIGNOur ApproachSYSTEM DESIGNSignal RepresentationSYSTEM DESIGNMulti-F0 AnalysisSYSTEM DESIGNPredominant-F0 Trajectory ExtractionEVALUATIONPredominant-F0 extraction: Indian MusicSYSTEM DESIGNVoicing DetectionEVALUATIONSubmission to MIREX 2008 & 2009EVALUATIONMIREX 2008 & 2009 Datasets & EvaluationEVALUATIONMIREX 2009 & 2010 MIREX05 dataset (vocal)EVALUATIONMIREX 2009 & 2010 MIREX09 dataset (0 dB mix)EVALUATION Problems in Melody ExtractionENHANCEMENTS: PREDOMINANT-F0 TRACKINGProblemsENHANCEMENTS: PREDOMINANT-F0 TRACKINGDesign & ImplementationENHANCEMENTS: PREDOMINANT-F0 TRACKINGSelection of Predominant-F0 contourENHANCEMENTS: PREDOMINANT-F0 TRACKINGFinal Implementation Block DiagramENHANCEMENTS: PREDOMINANT-F0 TRACKINGExperimental evaluation: SetupENHANCEMENTS: PREDOMINANT-F0 TRACKINGExperimental Evaluation: Data & ResultsENHANCEMENTS: PREDOMINANT-F0 TRACKINGExperimental evaluation: Results (Dual-F0)ENHANCEMENTS: PREDOMINANT-F0 EXTRACTIONExample of F0 collisionsENHANCEMENTS: VOICING DETECTIONProblemsENHANCEMENTS: VOICING DETECTIONFeaturesENHANCEMENTS: VOICING DETECTIONDataENHANCEMENTS: VOICING DETECTIONEvaluationSlide Number 31ENHANCEMENTS: SIGNAL REPRESENATIONSparsity-driven window length adaptationENHANCEMENTS: SIGNAL REPRESENATIONSparsity-driven window length adaptation [contd.]Slide Number 34Slide Number 35CONCLUSIONS AND FUTURE WORKFinal system block diagramCONCLUSIONS AND FUTURE WORKConclusionsCONCLUSIONS AND FUTURE WORKSummary of contributionsCONCLUSIONS AND FUTURE WORKFuture workCONCLUSIONS AND FUTURE WORKList of related publicationsCONCLUSIONS AND FUTURE WORKList of related publications [contd.]REFERENCES

Documents

Vocal Melody Extraction from Polyphonic Audio with … · Polyphonic Audio with Pitched Accompaniment ... Vocal melody extraction from polyphonic audio ... 40 sec) from rock, R&B,