Upload
tranthien
View
234
Download
4
Embed Size (px)
Citation preview
Vocal Melody Extraction from Polyphonic Audio with Pitched
Accompaniment
Vishweshwara Rao (05407001)Ph.D. Defense
Guide: Prof. Preeti Rao
(June 2011)Department of Electrical Engineering
Indian Institute of Technology Bombay
Department of Electrical Engineering , IIT Bombay 2 of 25of 40
OUTLINE
Introduction Objective, background, motivation, approaches & issues
Indian music
Proposed melody extraction system Design
Evaluation
Problems Competing pitched accompanying instrument
Enhancements for increasing robustness to pitched accompaniment Dual-F0 tracking
Identification of vocal segments by combination of static and dynamic features
Signal-sparsity driven window length adaptation
Graphical User Interface for melody extraction
Conclusions and Future work
Department of Electrical Engineering , IIT Bombay 3 of 25of 40
INTRODUCTIONObjective
Vocal melody extraction from polyphonic audio Polyphony : Multiple musical sound sources present
Vocal : Lead melodic instrument is the singing voice
Melody Sequence of notes
Symbolic representation of music
Pitch contour of the singing voice
Note frequency
time
Department of Electrical Engineering , IIT Bombay 4 of 25of 40
Pitch Perceptual attribute of sound Closely related to periodicity or fundamental frequency (F0)
Vocal pitch contour
INTRODUCTIONBackground
F0 = 1/ T0 = 300 HzF0 = 1/ T0 = 100 Hz
Department of Electrical Engineering , IIT Bombay 5 of 25of 40
INTRODUCTIONMotivation, Complexity and Approaches
Motivation Music Information Retrieval applications Query-by-singing/humming (QBSH),
Artist ID, Cover Song ID Music Edutainment Singing learning, karaoke creation Musicology
Problem complexity Singing large F0 range, pitch dynamics, Diversity Inter-singer, across
cultures Polyphony Crowded signal Percussive & tonal instruments
Approaches Understanding without separation Source-separation [Lag08] Classification [Pol05] Voice F0
contour
Signal representation
Multi-F0 analysis
Predominant-F0 trajectory extraction
Voicing detection
Polyphonic audio signal
Department of Electrical Engineering , IIT Bombay 6 of 25of 40
INTRODUCTIONIndian classical music: Signal characteristics
Time (sec)
Freq
uenc
y (H
z)
0 5 10 15 200
500
1000
1500
2000
Ghe Na Tun
Tabla (percussion)
Tanpura (drone)Singer
Harmonium (secondary melody)
Department of Electrical Engineering , IIT Bombay 7 of 25of 40
INTRODUCTIONMelody extraction in Indian classical music
Issues Signal complexity
Singing Polyphony
Variable tonic Non-availability of ground-truth data
Almost completely improvised (no universally accepted notation)
Example
Thit Ke Tun
Department of Electrical Engineering , IIT Bombay 8 of 25of 40
SYSTEM DESIGNOur Approach
Design considerations Singing Robustness to pitched accompaniment Flexible
Signal representation
Multi-F0 analysis
Predominant-F0 trajectory extraction
Singing voice detection
Polyphonic audio signal
Voice F0 contour
Department of Electrical Engineering , IIT Bombay 9 of 25of 40
SYSTEM DESIGNSignal Representation
-500 0 500-20-10
0102030405060
Frequency
Mag
nitu
de (d
B)
x(m)
x(m)w(n-m)
0
( , )X n
w(n-m)
DFT
( ) ( )2 1
0( , )
iM mM
mX n x m w n m e
==
Frequency transform of
a 40 ms Hamming window
Frequency domain representation Pitched sounds have harmonic spectra Short-time analysis and DFT Window-length
Chosen to resolve harmonics of minimum expected F0
Sinusoidal representation More compact & relevant Different methods of sinusoid ID
Magnitude-based Phase-based
Main-lobe matching (Sinusoidality) [Grif88]method found to be most reliable Frequency transform of window has a
known shape Local peaks whose shape closely matches
window main-lobe are declared as sinusoids
Department of Electrical Engineering , IIT Bombay 10 of 25of 40
SYSTEM DESIGNMulti-F0 Analysis
Objective To reliably detect the voice-F0 in polyphony with a high salience
F0-candidate identification Sub-multiples of well-formed sinusoids (Sinusoidality>0.8)
F0-salience function Typical salience functions
Maximize Auto-correlation function (ACF) Maximize comb-filter output
Harmonic sieve-type [Pol07] Sensitive to strong harmonic sounds
Two-way mismatch [Mah94] Error function sensitive to the deviation of measured
partials/sinusoids from ideal harmonic locations
F0-candidate pruning Sort in ascending order of TWM errors Prune weaker F0-candidates in close vicinity
(25 cents) of stronger F0 candidates
PresenterPresentation NotesTWM error biased towards voice because of large spectral spread of voice (till 5 kHz). Harmonics always predicted till 5 kHz.
Department of Electrical Engineering , IIT Bombay 11 of 25of 40
SYSTEM DESIGNPredominant-F0 Trajectory Extraction
-0.4 -0.2 0 0.2 0.40
0.2
0.4
0.6
0.8
1
Log change in pitch
Normalized distribution of adjacent frame pitch transitions for male & female singers (Hop =10 ms)
-1 -0.5 0 0.5 10
0.2
0.4
0.6
0.8
1Cost functions
Log change in pitch
Std. Dev = 0.1
OJC = 1.0
( )2W(p,p ) OJC. log p '/ p =
( ) ( )( )22 2log p ' log p2W(p ,p ') 1 e
=
p and p are F0s in current and previous frames resp.
Objective To find that path through the F0-candidate v/s time
space that best represents the predominant-F0 trajectory
Dynamic-programming [Ney83] based path finding Measurement cost = TWM error Smoothness cost must be based on musicological
considerations
Department of Electrical Engineering , IIT Bombay 12 of 25of 40
Genre Audio content PA (%) CA (%)
Indianclassicalmusic
Voice + percussion 97.4 99.5
Voice + percussion + drone 92.9 98.2
Voice + percussion + drone + harmonium 67.3 73.04
Indian popmusic Voice + guitar 96.7 96.7
Parameter Value
Frame length 40 ms
Hop 10 ms
Lower limit on F0 100 Hz
Upper limit on F0 1280 Hz
Upper limit on spectral content 5000 Hz
Data Classical: 4 min. of multi-track data, Film: 2 min. of multi-track data Ground truth: Output of YIN PDA [Chev02] on clean voice tracks with manual
correction Evaluation metrics
Pitch Accuracy (PA) = % of vocal frames whose pitch has been correctly tracked (within 50 cents)
Chroma Accuracy (CA) = PA except that octave errors are forgiven
EVALUATIONPredominant-F0 extraction: Indian Music
Department of Electrical Engineering , IIT Bombay 13 of 25of 40
SYSTEM DESIGNVoicing Detection
Frame-level After grouping
Feature set Vocal recall (%)Instrumental
recall (%)Vocal recall (%)
Instrumental recall (%)
FS1 92.17 66.43 97.86 61.61
FS2 92.38 66.29 96.28 68.98
FS3 89.05 92.10 93.15 96.58
Features FS1 13 MFCCs FS2 7 static timbral features FS3 Normalized harmonic energy (NHE)
Classifier GMM 4 mixtures per class
Boundary detector Audio novelty detector [Foote] with NHE
Data 23 min. of Hindustani training data 7 min. of Hindustani testing data
Results on testing data
Recall: % of actual frames that were correctly labeled
Feature Extraction
Classifier
Boundary detector Grouping
Polyphonic signal
Decision labels
PresenterPresentation NotesTried combining FS2 and FS3. did not increase performance. FS2 arrived at using MI and cross-validation experiments.
Department of Electrical Engineering , IIT Bombay 14 of 25of 40
EVALUATIONSubmission to MIREX 2008 & 2009
Music Signal
Signal representation
Predominant-F0 trajectory
extraction
Vocal segment pitch tracks
Sinusoids frequencies and magnitudes
DFT Main Lobe MatchingParabolic
interpolation
TWM error computation
Sort (Ascending)
Sub-multiples of sinusoids
F0 search range
Vicinity Pruning
F0
candidates
Multi-F0 analysis
F0 candidates and measurement costs
Dynamic programming-based optimal path finding
Predominant F0 contour
Thresholding normalized
harmonic energy
Grouping over homogenous
segmentsVoicing
detection
Music Information Retrieval Evaluation eXchange Started in 2004 International Music Information
Retrieval Systems Evaluation Laboratory (IMIRSEL)
Common platform for evaluation on common datasets
Tasks Audio genre, artist, mood
classification Audio melody extraction Audio beat tracking Audio Key detection Query by singing/ hummin Audio chord estimation
Department of Electrical Engineering , IIT Bombay 15 of 25of 40
EVALUATIONMIREX 2008 & 2009 Datasets & Evaluation
Data ADC 2004: Publicly available data
20 excerpts (about 20 sec each) from pop, opera, jazz & midi MIREX 2005: Secret data
25 excerpts (10 40 sec) from rock, R&B, pop, jazz, solo piano MIREX 2008: ICM data
4 excerpts of 1 minute each from a male and female Hindustani vocal performance. 2 min. each with and without a loud harmonium
MIREX 2009: MIR 1K data 374 Karaoke recordings of Chinese songs. Each recording is mixed at 3 different
Signal-to-accompaniment ratios (SARs) {-5,0,5 dB}
Evaluation metrics: Pitch evaluation
Pitch accuracy (PA) and Chroma accuracy (CA) Voicing evaluation
Vocal recall (Vx recall) and Vocal false alarm rate (Vx false alm) Overall accuracy
% of correctly detected vocal frames with correctly detected pitch Run-time
PresenterPresentation NotesOverall accuracy is computed as proportion of frames correctly labeled with pitch and voicing. Since there is a strong bias 80/20 towards voiced frames. The more voiced frames detected the better overall accuracy you get. We set our voicing threshold as per the best trade-off obtained between voiced/instrumental frame detection. If we bias the voicing threshold towards detecting more voiced frames, the overall accuracy was found to improve and come on par with the best systems.
Department of Electrical Engineering , IIT Bombay 16 of 25of 40
EVALUATIONMIREX 2009 & 2010 MIREX05 dataset (vocal)
2009
Participant VxRecallVx
False AlmPitch
accuracyChromaaccuracy
Overall Accuracy
Runtime(dd:hh:mm)
cl1 91.2267 67.7071 70.807 73.924 59.7095 -cl2 80.9512 44.8062 70.807 73.924 64.4609 -dr1 93.7543 53.5255 76.1145 77.7138 66.9613 -dr2 88.0635 38.0207 70.9258 75.9161 66.5175 -hjc1 65.8441 19.8206 62.6594 73.4868 54.8538 -hjc2 65.8441 19.8206 54.1294 69.4221 52.2905 -jjy 88.8546 41.9813 76.2696 79.3236 66.3123 -kd 82.6017 15.2554 77.4622 80.8218 76.962 -mw 99.9345 99.7947 75.7398 80.3791 53.7323 -pc 75.6436 21.0879 71.7068 72.5089 70.465 -rr 92.908 56.3639 75.9506 79.1084 65.7745 -
toos 91.2267 67.7071 70.807 73.924 59.7095 -2010
HJ1 70.80 44.90 71.70 74.93 53.89 00:59:31TOOS1 84.65 41.93 68.87 74.62 60.84 00:50:31JJY2 96.86 69.62 70.17 78.32 60.81 01:09:30JJY1 97.33 70.16 71.58 78.61 61.54 03:48:02SG1 76.39 22.78 61.80 73.70 62.13 00:08:15
PresenterPresentation NotesOverall accuracy is computed as proportion of frames correctly labeled with pitch and voicing. Since there is a strong bias 80/20 towards voiced frames. The more voiced frames detected the better overall accuracy you get. We set our voicing threshold as per the best trade-off obtained between voiced/instrumental frame detection. If we bias the voicing threshold towards detecting more voiced frames, the overall accuracy was found to improve and come on par with the best systems.
Department of Electrical Engineering , IIT Bombay 17 of 25of 40
EVALUATIONMIREX 2009 & 2010 MIREX09 dataset (0 dB mix)
2009
Participant VxRecallVx
False AlmPitch
accuracyChromaaccuracy
Overall Accuracy
Runtime(dd:hh:mm)
cl1 92.4858 83.5749 59.138 62.9508 43.9659 00:00:28cl2 77.2085 59.7352 59.138 62.9508 49.2294 00:00:33dr1 91.8716 55.3555 69.8804 72.5138 60.1294 16:00:00dr2 87.3985 47.3422 66.549 70.7923 59.5076 00:08:44hjc1 34.1722 1.7909 72.6577 75.2906 53.1752 00:05:44hjc2 34.1722 1.7909 51.6871 70.002 51.7469 00:09:38jjy 38.906 19.4063 75.9354 80.2461 49.686 02:14:06kd 91.1846 47.7842 80.4565 81.8811 68.2237 00:00:24mw 99.992 99.4688 67.2905 71.0018 43.6365 00:02:12pc 73.1175 43.4773 50.8895 53.3672 51.5001 03:05:57rr 88.8091 50.7595 68.6242 71.3714 60.7733 00:00:26
toos 99.9829 99.4185 82.2943 85.7474 53.5623 01:00:282010
HJ1 82.06 14.27 83.15 84.23 76.17 14:39:16TOOS1 94.17 38.58 82.59 86.18 72.23 12:07:21JJY2 98.33 70.62 81.29 83.83 62.55 14:06:20JJY1 98.34 70.65 82.20 84.57 62.90 65:21:11SG1 89.65 30.22 80.05 85.50 73.59 01:56:27
PresenterPresentation NotesOverall accuracy is computed as proportion of frames correctly labeled with pitch and voicing. Since there is a strong bias 80/20 towards voiced frames. The more voiced frames detected the better overall accuracy you get. We set our voicing threshold as per the best trade-off obtained between voiced/instrumental frame detection. If we bias the voicing threshold towards detecting more voiced frames, the overall accuracy was found to improve and come on par with the best systems.
Department of Electrical Engineering , IIT Bombay 18 of 25of 40
No bold increase in melody extraction over the last 3 years (2007-2009) [Dres2010]
Errors due to loud pitched accompaniment Accompaniment pitch tracked instead of voice
Error in Predominant-F0 trajectory extraction
Accompaniment pitch tracked along with voice Error in voicing detection
Errors due to signal dynamics Octave errors due to fixed window length
Error in Signal representation
EVALUATION Problems in Melody Extraction
Department of Electrical Engineering , IIT Bombay 19 of 25of 40
Incorrect tracking of loud pitched accompaniment
ICM Data Largest reduction in accuracy for audio in which Voice displays large rapid modulations Instrument pitch is flat
Predominant F0 trajectory DP-based path finding Based on suitably defined
Measurement cost Smoothness cost
Accompaniment errors Bias in measurement cost: Salient (spectrally rich) instrument Bias in smoothness cost: Stable-pitched instrument
ENHANCEMENTS: PREDOMINANT-F0 TRACKINGProblems
Department of Electrical Engineering , IIT Bombay 20 of 25of 40
Extension of DP to track F0 candidate ordered pairs (nodes) called Dual-F0 tracking
Node formation All possible pairs
Computationally expensive (10P2 = 90) F0 and (sub) multiple may be tracked
Prohibit pairing of harmonically related F0 candidates Low harmonic threshold of 5 cents Allows pairing of voice F0 and octave-separated instrument F0 because of voice
detuning
Node measurement cost computation Joint TWM error [Mah94]
Node smoothness cost computation Sum of corresponding F0 candidate smoothness costs
Final selection of predominant-F0 contour Based on voice-harmonic instability
ENHANCEMENTS: PREDOMINANT-F0 TRACKINGDesign & Implementation
PresenterPresentation NotesnPk = fact(n)/fact(n-k). Also modeling interference. At most one other competing pitched instrument will be present.
Department of Electrical Engineering , IIT Bombay 21 of 25of 40
Harmonic Sinusoidal Model (HSM) Partial tracking algorithm used in SMS [Serra98] Tracks are indexed and linked by harmonic number
Std. dev. pruning Prune tracks in 200 ms segments whose std. dev.
Department of Electrical Engineering , IIT Bombay 22 of 25of 40
ENHANCEMENTS: PREDOMINANT-F0 TRACKINGFinal Implementation Block Diagram
Music Signal
Signal representation
Sinusoids frequencies and magnitudes
DFTMain-lobe matching
Parabolic interpolation
TWM errorcomputation
Sorting (Ascending)
Sub-multiples of sinusoids
F0 search
range
Vicinity pruning
F0
candidates
Multi-F0 analysis
F0 candidates and saliences
Optimal path finding
Ordered pairingof F0 candidates
with harmonic constraintJoint TWM error
computation
Nodes (F0 pairs)
Melodic contour
Optimal path finding
Melodic contour
Predominant-F0 trajectory extractionVocal pitch
identification
PresenterPresentation NotesInsert box between dual-F0 output and Single-F0 output
Department of Electrical Engineering , IIT Bombay 23 of 25of 40
Participating systems TWMDP (single- and dual-F0) LIWANG [LiWang07]
Uses HMM to track predominant-F0 Includes the possibility of a 2-pitch hypothesis but finally outputs a single F0 Shown to be superior to other contemporary systems
Same F0 search range [80 500 Hz]
Evaluation metrics Multi-F0 stage
% presence of true voice F0 in candidate list Predominant-F0 extraction (PA & CA)
Single F0 Dual-F0
Final contour accuracy Either-Or accuracy : Correct pitch is present in at least one of the two outputs
ENHANCEMENTS: PREDOMINANT-F0 TRACKINGExperimental evaluation: Setup
Department of Electrical Engineering , IIT Bombay 24 of 25of 40
ENHANCEMENTS: PREDOMINANT-F0 TRACKINGExperimental Evaluation: Data & Results
DATASET DESCRIPTION VOCAL(SEC)TOTAL(SEC)
1 Li & Wang data 55.4 97.5
2 Examples from MIR-1k dataset with loud pitched accompaniment 61.8 98.1
3 Examples from MIREX08 data (Indian classical music) 91.2 99.2
TOTAL 208.4 294.8
DATASETPERCENTAGE PRESENCE OF VOICE-F0 (%)
TOP 5CANDIDATES
TOP 10 CANDIDATES
1 92.9 95.4
2 88.5 95.1
3 90.0 94.1
10 5 0 -50
20
40
60
80
100
SARs (dB)
Pitch
acc
urac
ies (%
)
(a)
LIWANG TWMDP
10 5 0 -50
20
40
60
80
100
SARs (dB)
Chr
oma
accu
raci
es (%
)
(b)
LIWANG TWMDP
Multi-F0 EvaluationData
Comparison of TWMDP single-F0 tracker and LIWANG for dataset 1
PresenterPresentation NotesInsert sound examples
Department of Electrical Engineering , IIT Bombay 25 of 25of 40
ENHANCEMENTS: PREDOMINANT-F0 TRACKINGExperimental evaluation: Results (Dual-F0)
TWMDP (% IMPROVEMENT OVER LIWANG (A1))
DATA-SET
SINGLE-F0 (A2)DUAL-F0
EITHER-OR FINAL (A3)
1PA (%) 88.5 (8.3) 89.3 (0.9) 84.1 (2.9)
CA (%) 90.2 (6.4) 92.0 (1.1) 88.8 (3.9)
2PA (%) 57.0 (24.5) 74.2 (-6.8) 69.1 (50.9)
CA (%) 61.1 (14.2) 81.2 (-5.3) 74.1 (38.5)
3PA (%) 66.0 (11.3) 85.7 (30.2) 73.9 (24.6)
CA (%) 66.5 (9.7) 87.1 (18.0) 76.3 (25.9)
TWMDP Single-F0 significantly better than LIWANG system for all datasets
TWMDP Dual-F0 significantly better than TWMDP single-F0 for datasets 2 & 3
Scope for further improvement in final predominant-F0 identification Indicated by difference between TWMDP Dual-F0 Either-Or and Final accuracies
0
10
20
30
40
50
60
70
80
D2 (PA) D2 (CA) D3(PA) D3 (CA)
A3A2A1
PresenterPresentation NotesState reasons for TWMDP better than LIWANG
Department of Electrical Engineering , IIT Bombay 26 of 25of 40
ENHANCEMENTS: PREDOMINANT-F0 EXTRACTIONExample of F0 collisions
Contour switching occurs at F0 collisions
1
1.5
2F0
(Oct
aves
ref.
110
Hz)
(a) Single-F0 tracking
1
1.5
2(b) Dual-F0 tracking (intermediate)
0 1 2 3 41
1.5
2
Time (sec)
(c) Dual-F0 Tracking (final)
Ground truth voice pitchGround truth harmonium pitchSingle-F0 output
Ground truth voice pitchDual F0 contour 1Dual F0 contour 2
Ground truth voice pitchDual F0 final output
PresenterPresentation NotesInsert vocal harmony example
Department of Electrical Engineering , IIT Bombay 27 of 25of 40
ENHANCEMENTS: VOICING DETECTIONProblems
Department of Electrical Engineering , IIT Bombay 28 of 25of 40
ENHANCEMENTS: VOICING DETECTIONFeatures
C1Static timbral
C2Dynamic timbral
C3Dynamic F0-Harmonic
F0 10 Harmonic powers Mean & median of F0
10 Harmonic powers SC & SE Mean, median & Std.Dev. of Harmonic [0 2 kHz]
Spectral centroid (SE) Std. Dev. of SC for 0.5, 1& 2 secMean, median & Std.Dev. of Harmonic [2
5 kHz]
Sub-band energy (SE) MER of SC for 0.5, 1 & 2secMean, median & Std.Dev. of Harmonics 1
to 5
Std. Dev. of SE for 0.5, 1 & 2 sec
Mean, median & Std.Dev. of Harmonics 6 to10
MER of SE for 0.5, 1 & 2 secMean, median & Std.Dev. of Harmonics 1
to10
Ratio of mean, median & Std.dev. of Harmonics 1 to 5 : Harmonics 6 to 10
Proposed feature set combination of static & dynamic features
Features extracted using a harmonic sinusoidal model representation
Feature selection in each feature set using information entropy [Weka]
MER Modulation energy ratio
Department of Electrical Engineering , IIT Bombay 29 of 25of 40
ENHANCEMENTS: VOICING DETECTIONData
Genre Singing Dominant InstrumentI
WesternSyllabic. No large pitch modulations. Voice often softer than instrument.
Mainly flat-note (piano, guitar). Pitch range overlapping with voice.
II
GreekSyllabic. Replete with fast, pitch
modulations.
Equal occurrence of flat-note plucked-string /accordion and of pitch-modulated
violin.
III
Bollywood
Syllabic. More pitch modulations than western but lesser than other Indian
genres.
Mainly pitch-modulated wood-wind & bowed instruments. Pitches often
much higher than voice.
IV
Hindustani
Syllabic and melismatic. Varies from long, pitch-flat, vowel-only notes to
large & rapid modulations.
Mainly flat-note harmonium (woodwind). Pitch range overlapping
with voice.
VCarnatic
Syllabic and melismatic. Replete with fast pitch modulations.
Mainly pitch-modulated violin. F0 range generally higher than voice but
has some overlap in pitch range.
Genre Number of songsVocal
durationInstrumental
durationOverall
durationI. Western 11 7m 19s 7m 02s 14m 21sII. Greek 10 6m 30s 6m 29s 12m 59s
III. Bollywood 13 6m 10s 6m 26s 12m 36sIV. Hindustani 8 7m 10s 5m 24s 12m 54s
V. Carnatic 12 6m 15s 5m 58s 12m 13sTotal 45 33m 44s 31m 19s 65m 03s
Department of Electrical Engineering , IIT Bombay 30 of 25of 40
0.5 0.6 0.7 0.8 0.9 10.6
0.7
0.8
0.9
1
Vocal Recall
Voca
l Pre
cision
Vocal Precision v/s Recall curves for different feature sets across genres in 'Leave 1 song out' experiment
MFCCC1C1+C2+C3
ENHANCEMENTS: VOICING DETECTIONEvaluation
Two cross-validation experiments Intra-genre Leave 1 song out Inter-genre Leave 1 genre out
Feature combination Concatenation Classifier combination
Baseline features 13 MFCCs [Roc07]
EvaluationVocal Recall (%) and precision (%)
Overall Results C1 better than baseline C1+C2+C3 better than C1 Classifier combination better than feature
concatenation
Department of Electrical Engineering , IIT Bombay 31 of 25of 40
ENHANCEMENTS: VOICING DETECTIONEvaluation (contd.)
I II III IV V Total
Baseline 77.2 66.0 65.6 82.6 83.2 74.9
F0-MFCCs 76.9 72.3 70.0 78.9 83.0 76.2
C1 81.1 67.8 74.8 78.9 84.5 77.4
C1+C2 82.9 78.5 82.8 79.5 85.0 81.7
C1+C3 81.5 72.9 77.6 83.9 84.9 80.2
C1+C2+C3 82.1 81.1 83.5 83.0 84.7 82.8
I II III IV V Total
Baseline 77.2 66.0 65.6 82.6 83.2 74.9
F0-MFCCs 78.9 77.8 78.0 85.9 85.9 81.2
C1 79.6 77.4 79.3 82.3 87.1 81.0
C1+C2 82.3 83.6 85.4 83.3 86.8 84.2
C1+C3 80.2 83.4 81.7 89.7 88.2 84.5
C1+C2+C3 81.1 86.9 86.4 88.5 87.3 85.9
Leave 1 Song out (Recall %)
Semi-automatic F0-driven HSM Fully-automatic F0-driven HSM
Genre-specific feature set adaptation C1+C2 Western C1+C3 - Hindustani
Department of Electrical Engineering , IIT Bombay 32 of 25of 40
Relation between window length and signal characteristics Dense spectrum (multiple harmonic sources) -> long window Non-stationarity (rapid pitch modulations) -> short window
Adaptive time segmentation for signal modeling and synthesis [Good97] Based on minimizing reconstruction error between synthesized and original
signals High computational cost
Easily computable measures for adapting window length Signal sparsity sparse spectrum has concentrated components
Window length selection (23.2, 46.4 92.9 ms) based on maximizing signal sparsity
ENHANCEMENTS: SIGNAL REPRESENATIONSparsity-driven window length adaptation
L2 Norm Normalized kurtosis Gini Index Hoyer measure Spectral flatness
( )22 nk
L X k=
4
22
1 ( )
1 ( )
nk
nk
X k XN
KU
X k XN
=
( )
1
0.51 2k
n
k
X N kGIX N
+ =
( )( ) 1
21
( )
nk
nk
X kHO N N
X k
=
2
2
( )
1 ( )
N nk
nk
X kSF
X kN
=
PresenterPresentation NotesXbar is mean spectral amplitude. X superscript k is X(k) sorted in ascending order to give ordered magnitude spectral coefficients. Maximize KU, HO, GI and minimize L2 and ASF
Department of Electrical Engineering , IIT Bombay 33 of 25of 40
ENHANCEMENTS: SIGNAL REPRESENATIONSparsity-driven window length adaptation [contd.]
Experimental comparison between fixed and adaptive schemes Fixed and adaptive window lengths (different sparsity measures)
Sinusoid detection by main-lobe matching
Data Simulations: Two sound mixtures (Polyphony) and vibrato signal Real: Western pop (Whitney, Mariah) and Hindustani taans
Evaluation metrics Recall (%) and frequency deviation (Hz) Expected harmonic locations computed from ground-truth pitch
Results1. Adaptive higher recall and
lower frequency deviation
2. Kurtosis driven adaptation is superior than other sparsitymeasures
Department of Electrical Engineering , IIT Bombay 34 of 25of 40
Generalized music transcriptions system still unavailable
Solution [Wang08] Semi-automatic approach Application-specific design E.g. music tutoring
Two, possibly independent, aspects of melody extraction Voice pitch extraction Manually difficult Vocal segment detection Manually easier
Semi-automatic tool Goal: To facilitate the extraction & validation of the voice pitch in polyphonic
recordings with minimal human intervention Design considerations
Accurate pitch detection Completely parametric control User-friendly control for vocal segment detection
GRAPHICAL USER INTERFACEMotivation
PresenterPresentation NotesWe are not going to neglect voicing detection. In fact we have devised a feature that exploits pitch-instability to discriminate between voice and stable-pitch instruments [ICSLP09]
Department of Electrical Engineering , IIT Bombay 35 of 25of 40
Salient features Melody extraction back-end
Validation Visual: Spectrogram Aural: Re-synthesis
Segmental parameter variation
Easy non-vocal labeling
Saving final result & parameters
Selective use of dual-F0 tracker Switching between contours
GRAPHICAL USER INTERFACEDesign
A Waveform viewerB Spectrogram & pitch viewC Menu barD Controls for viewing, scrolling, playback
& volume controlE Parameter windowF Log viewer
PresenterPresentation NotesWe are not going to neglect voicing detection. In fact we have devised a feature that exploits pitch-instability to discriminate between voice and stable-pitch instruments [ICSLP09]
Department of Electrical Engineering , IIT Bombay 36 of 25of 40
CONCLUSIONS AND FUTURE WORKFinal system block diagram
Music Signal
Signal representation
Sinusoids frequencies and magnitudes
DFTMain-lobe matching
Parabolic interpolation
TWM errorcomputation
Sorting (Ascending)
Sub-multiples of sinusoids
F0 search
range
Vicinity pruning
F0
candidates
Multi-F0 analysis
F0 candidates and saliences
Optimal path finding
Ordered pairingof F0 candidates
with harmonic constraintJoint TWM error
computation
Nodes (F0 pairs and saliences)
Predominant F0 contour
Optimal path finding
Predominant F0 contour
Predominant F0 trajectory
extraction
Vocal pitch identification
Harmonic Sinusoidal Model
Feature Extraction
Classifier
Grouping
Boundary Deletion
Voice Pitch Contour
Voicing Detector
Department of Electrical Engineering , IIT Bombay 37 of 25of 40
State-of-the-art melody extraction system designed by making careful choices for system modules
Enhancements to above system increase robustness to loud, pitched accompaniment Dual-F0 tracking for predominant-F0 extraction Combination of static & dynamic, timbral & F0-harmonic features for
voicing detection
Fully-automatic, high accuracy melody extraction still not feasible Large variability in underlying signal conditions due to diversity of music
A priori knowledge of music and signal conditions Male/Female singer Rate of pitch variation
High accuracy melodic contours can be extracted using a semi-automatic approach
CONCLUSIONS AND FUTURE WORKConclusions
Department of Electrical Engineering , IIT Bombay 38 of 25of 40
CONCLUSIONS AND FUTURE WORKSummary of contributions
Design & validation of a novel, practically useful melody extraction system with increased robustness to pitched accompaniment Signal representation
Choice of main-lobe matching criterion for sinusoid identification Improved sinusoid detection by signal sparsity driven window length adaptation
Multi-F0 analysis Choice of TWM error as salience function Improved voice-F0 detection by separation of F0 candidate identification & salience
computation Predominant-F0 trajectory extraction
Gaussian log smoothness cost Dual-F0 tracking Final predominant-F0 contour identification by voice-harmonic instability
Voicing detection Use of predominant-F0-derived signal representation Combination of static and dynamic, timbral and F0-harmonic features
Design of a novel graphical user interface for semi-automatic use of the melody extraction system
Department of Electrical Engineering , IIT Bombay 39 of 25of 40
Melody Extraction Identification of single predominant-F0 contour from dual-F0 output
Use of dynamic features F0 collisions
Detection based on minima in difference of constituent F0s of nodes Correction allowing pairing of F0 with itself around these locations Use of prediction-based partial tracking [Lag07]
Validation across larger, more diverse datasets Incorporate predictive path-finding in DP algorithm Extend algorithm to instrumental pitch tracking in polyphony
Homophonic music Lead instrument (e.g. flute) with accompaniment Polyphonic instruments (sitar)
Applications of Melody Extraction Singing evaluation & feedback QBSH systems Musicological studies
CONCLUSIONS AND FUTURE WORKFuture work
Department of Electrical Engineering , IIT Bombay 40 of 25of 40
International Journals
V. Rao, P. Gaddipati and P. Rao, Signal-driven window adaptation for sinusoid identification in polyphonicmusic, IEEE Transactions on Audio, Speech, and Language Processing, 2011. (accepted)
V. Rao and P. Rao, Vocal melody extraction in the presence of pitched accompaniment in polyphonic music,IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no. 8, pp. 21452154, Nov. 2010.
International Conferences
V. Rao, C. Gupta and P. Rao, Context-aware features for singing voice detection in polyphonic music, 9thInternational Workshop on Adaptive Multimedia Retrieval, 2011. (submitted for review).
V. Rao, S. Ramakrishnan and P. Rao, Singing voice detection in polyphonic music using predominant pitch, inProceedings of InterSpeech, Brighton, U.K., 2009.
V. Rao and P. Rao, Improving polyphonic melody extraction by dynamic programming-based dual-F0 tracking,in Proceedings of the 12th International Conference on Digital Audio Effects (DAFx), Como, Italy, 2009.
V. Rao and P. Rao, Vocal melody detection in the presence of pitched accompaniment using harmonicmatching methods, in Proceedings of the 11th International Conference on Digital Audio Effects (DAFx),Espoo, Finland, 2008.
A. Bapat, V. Rao and P. Rao, Melodic contour extraction for Indian classical vocal music, in Proceedings ofMusic-AI (International Workshop on Artificial Intelligence and Music) in IJCAI, Hyderabad, India, 2007.
V. Rao and P. Rao, Melody extraction using harmonic matching, in Proceedings of the Music InformationRetrieval Exchange MIREX 2008 & 2009, URL: http://www.music-ir.org/mirex/abstracts/2009/RR.pdf
CONCLUSIONS AND FUTURE WORKList of related publications
Department of Electrical Engineering , IIT Bombay 41 of 25of 40
CONCLUSIONS AND FUTURE WORKList of related publications [contd.]
National Conferences
S. Pant, V. Rao and P. Rao, A melody detection user interface for polyphonic music, in Proceedings of NationalConference on Communication (NCC), Chennai, India, 2010.
N. Santosh, S. Ramakrishnan, V. Rao and P. Rao, Improving singing voice detection in the presence of pitchedaccompaniment, in Proceedings of National Conference on Communication (NCC), Guwahati, India, 2009.
V.Rao, S. Pant, M. Bhaskar and P. Rao, Applications of a semi-automatic melody extraction interface for Indianmusic, in Proceedings of International Symposium on Frontiers of Research in Speech and Music (FRSM),Gwalior, India, Dec. 2009.
V. Rao, S. Ramakrishnan and P. Rao, Singing voice detection in north Indian classical music, in Proceedings ofNational Conference on Communication (NCC), Mumbai, India, 2008.
V. Rao and P. Rao, Objective evaluation of a melody extractor for north Indian classical vocal performances, in Proceedings of International Symposium on Frontiers of Research in Speech and Music (FRSM), Kolkota, India, 2008.
V. Rao and P. Rao, Vocal trill and glissando thresholds for Indian listeners, in Proceedings of International Symposium on Frontiers of Research in Speech and Music (FRSM), Mysore, India, 2007.
Patent
P. Rao, V. Rao and S. Pant, A device and method for scoring a singing voice, Indian Patent Application, No. 1338/MUM/2009, Filed June 2, 2009.
Department of Electrical Engineering , IIT Bombay 42 of 25of 40
REFERENCES[Pol07] G. Poliner, D. Ellis, A. Ehmann, E. Gomez, S. Streich and B. Ong, Melody transcription from music audio:
Approaches and evaluation, IEEE Trans. Audio, Speech, Lang., Process., vol. 15, no. 4, pp. 12471256, May 2007.
[Grif88] D. Griffin and J. Lim, Multiband Excitation Vocoder, IEEE Trans. on Acoust., Speech and Sig. Process., vol. 36, no. 8, pp. 12231235, 1988.
[Wang08] Y. Wang and B. Zhang, Application-specific music transcription for tutoring, IEEE Multimedia, vol. 15, no. 3, pp. 7074, 2008.
[Chev02] A. de Cheveign and H. Kawahara, YIN, a Fundamental Frequency Estimator for Speech and Music. J. Acoust. Soc. America, vol. 111, no. 4, pp. 19171930, 2002.
[LiWang07] Y. Li and D. Wang, Separation of singing voice from music accompaniment for monoauralrecordings, IEEE Trans. Audio, Speech and Lang. Process., vol. 15, no. 4, pp. 1475-1487, 2007.
[Mah94] R. Maher and J. Beauchamp, Fundamental Frequency Estimation of Musical Signals using a Two-Way Mismatch Procedure, J. Acoust. Soc. Amer., vol. 95, no. 4, pp. 22542263, Apr. 1994.
[Dres2010] K. Dressler, Audio melody extraction for MIREX 2009, Ilmenau: Fraunhofer IDMT, 2010.
[Ney83] H. Ney, Dynamic Programming Algorithm for Optimal Estimation of Speech Parameter Contours, IEEE Trans. Systems, Man and Cybernetics, vol. SMC-13, no. 3, pp. 208214, Apr. 1983.
[Lag07] M. Lagrange, S. Marchand and J. B. Rault, Enhancing the tracking of partials for the sinusoidal modeling of polyphonic sounds, IEEE Trans. Audio, Speech and Lang. Process., vol. 15, no. 5, pp. 1625-1634, 2007.
[Chao09] C. Hsu and R. Jang, On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset, IEEE Trans. Audio, Speech, and Lang. Process., 2009, accepted.
[Good97] M. Goodwin, Adaptive signal models: Theory, algorithms and audio applications, Ph. D. dissertation, MIT, 1997.
[Lag08] M. Lagrange, L. Martins, J. Murdoch and G. Tzanetakis, Normalised cuts for predominant melodic source separation, IEEE Trans. Audio, Speech, Lang., Process. (Sp. Issue on MIR), vol. 16, no.2, pp. 278290, Feb. 2008.
[Pol05] G. Poliner and D. Ellis, A classification approach to melody transcription, in Proc. Intl. Conf. Music Information Retrieval, London, 2005.
Vocal Melody Extraction from Polyphonic Audio with Pitched AccompanimentOUTLINEINTRODUCTIONObjectiveINTRODUCTIONBackgroundINTRODUCTIONMotivation, Complexity and ApproachesINTRODUCTIONIndian classical music: Signal characteristicsINTRODUCTIONMelody extraction in Indian classical musicSYSTEM DESIGNOur ApproachSYSTEM DESIGNSignal RepresentationSYSTEM DESIGNMulti-F0 AnalysisSYSTEM DESIGNPredominant-F0 Trajectory ExtractionEVALUATIONPredominant-F0 extraction: Indian MusicSYSTEM DESIGNVoicing DetectionEVALUATIONSubmission to MIREX 2008 & 2009EVALUATIONMIREX 2008 & 2009 Datasets & EvaluationEVALUATIONMIREX 2009 & 2010 MIREX05 dataset (vocal)EVALUATIONMIREX 2009 & 2010 MIREX09 dataset (0 dB mix)EVALUATION Problems in Melody ExtractionENHANCEMENTS: PREDOMINANT-F0 TRACKINGProblemsENHANCEMENTS: PREDOMINANT-F0 TRACKINGDesign & ImplementationENHANCEMENTS: PREDOMINANT-F0 TRACKINGSelection of Predominant-F0 contourENHANCEMENTS: PREDOMINANT-F0 TRACKINGFinal Implementation Block DiagramENHANCEMENTS: PREDOMINANT-F0 TRACKINGExperimental evaluation: SetupENHANCEMENTS: PREDOMINANT-F0 TRACKINGExperimental Evaluation: Data & ResultsENHANCEMENTS: PREDOMINANT-F0 TRACKINGExperimental evaluation: Results (Dual-F0)ENHANCEMENTS: PREDOMINANT-F0 EXTRACTIONExample of F0 collisionsENHANCEMENTS: VOICING DETECTIONProblemsENHANCEMENTS: VOICING DETECTIONFeaturesENHANCEMENTS: VOICING DETECTIONDataENHANCEMENTS: VOICING DETECTIONEvaluationSlide Number 31ENHANCEMENTS: SIGNAL REPRESENATIONSparsity-driven window length adaptationENHANCEMENTS: SIGNAL REPRESENATIONSparsity-driven window length adaptation [contd.]Slide Number 34Slide Number 35CONCLUSIONS AND FUTURE WORKFinal system block diagramCONCLUSIONS AND FUTURE WORKConclusionsCONCLUSIONS AND FUTURE WORKSummary of contributionsCONCLUSIONS AND FUTURE WORKFuture workCONCLUSIONS AND FUTURE WORKList of related publicationsCONCLUSIONS AND FUTURE WORKList of related publications [contd.]REFERENCES