42
Vocal Melody Extraction from Polyphonic Audio with Pitched Accompaniment Vishweshwara Rao (05407001) Ph.D. Defense Guide: Prof. Preeti Rao (June 2011) Department of Electrical Engineering Indian Institute of Technology Bombay

Vocal Melody Extraction from Polyphonic Audio with … · Polyphonic Audio with Pitched Accompaniment ... Vocal melody extraction from polyphonic audio ... 40 sec) from rock, R&B,

Embed Size (px)

Citation preview

  • Vocal Melody Extraction from Polyphonic Audio with Pitched

    Accompaniment

    Vishweshwara Rao (05407001)Ph.D. Defense

    Guide: Prof. Preeti Rao

    (June 2011)Department of Electrical Engineering

    Indian Institute of Technology Bombay

  • Department of Electrical Engineering , IIT Bombay 2 of 25of 40

    OUTLINE

    Introduction Objective, background, motivation, approaches & issues

    Indian music

    Proposed melody extraction system Design

    Evaluation

    Problems Competing pitched accompanying instrument

    Enhancements for increasing robustness to pitched accompaniment Dual-F0 tracking

    Identification of vocal segments by combination of static and dynamic features

    Signal-sparsity driven window length adaptation

    Graphical User Interface for melody extraction

    Conclusions and Future work

  • Department of Electrical Engineering , IIT Bombay 3 of 25of 40

    INTRODUCTIONObjective

    Vocal melody extraction from polyphonic audio Polyphony : Multiple musical sound sources present

    Vocal : Lead melodic instrument is the singing voice

    Melody Sequence of notes

    Symbolic representation of music

    Pitch contour of the singing voice

    Note frequency

    time

  • Department of Electrical Engineering , IIT Bombay 4 of 25of 40

    Pitch Perceptual attribute of sound Closely related to periodicity or fundamental frequency (F0)

    Vocal pitch contour

    INTRODUCTIONBackground

    F0 = 1/ T0 = 300 HzF0 = 1/ T0 = 100 Hz

  • Department of Electrical Engineering , IIT Bombay 5 of 25of 40

    INTRODUCTIONMotivation, Complexity and Approaches

    Motivation Music Information Retrieval applications Query-by-singing/humming (QBSH),

    Artist ID, Cover Song ID Music Edutainment Singing learning, karaoke creation Musicology

    Problem complexity Singing large F0 range, pitch dynamics, Diversity Inter-singer, across

    cultures Polyphony Crowded signal Percussive & tonal instruments

    Approaches Understanding without separation Source-separation [Lag08] Classification [Pol05] Voice F0

    contour

    Signal representation

    Multi-F0 analysis

    Predominant-F0 trajectory extraction

    Voicing detection

    Polyphonic audio signal

  • Department of Electrical Engineering , IIT Bombay 6 of 25of 40

    INTRODUCTIONIndian classical music: Signal characteristics

    Time (sec)

    Freq

    uenc

    y (H

    z)

    0 5 10 15 200

    500

    1000

    1500

    2000

    Ghe Na Tun

    Tabla (percussion)

    Tanpura (drone)Singer

    Harmonium (secondary melody)

  • Department of Electrical Engineering , IIT Bombay 7 of 25of 40

    INTRODUCTIONMelody extraction in Indian classical music

    Issues Signal complexity

    Singing Polyphony

    Variable tonic Non-availability of ground-truth data

    Almost completely improvised (no universally accepted notation)

    Example

    Thit Ke Tun

  • Department of Electrical Engineering , IIT Bombay 8 of 25of 40

    SYSTEM DESIGNOur Approach

    Design considerations Singing Robustness to pitched accompaniment Flexible

    Signal representation

    Multi-F0 analysis

    Predominant-F0 trajectory extraction

    Singing voice detection

    Polyphonic audio signal

    Voice F0 contour

  • Department of Electrical Engineering , IIT Bombay 9 of 25of 40

    SYSTEM DESIGNSignal Representation

    -500 0 500-20-10

    0102030405060

    Frequency

    Mag

    nitu

    de (d

    B)

    x(m)

    x(m)w(n-m)

    0

    ( , )X n

    w(n-m)

    DFT

    ( ) ( )2 1

    0( , )

    iM mM

    mX n x m w n m e

    ==

    Frequency transform of

    a 40 ms Hamming window

    Frequency domain representation Pitched sounds have harmonic spectra Short-time analysis and DFT Window-length

    Chosen to resolve harmonics of minimum expected F0

    Sinusoidal representation More compact & relevant Different methods of sinusoid ID

    Magnitude-based Phase-based

    Main-lobe matching (Sinusoidality) [Grif88]method found to be most reliable Frequency transform of window has a

    known shape Local peaks whose shape closely matches

    window main-lobe are declared as sinusoids

  • Department of Electrical Engineering , IIT Bombay 10 of 25of 40

    SYSTEM DESIGNMulti-F0 Analysis

    Objective To reliably detect the voice-F0 in polyphony with a high salience

    F0-candidate identification Sub-multiples of well-formed sinusoids (Sinusoidality>0.8)

    F0-salience function Typical salience functions

    Maximize Auto-correlation function (ACF) Maximize comb-filter output

    Harmonic sieve-type [Pol07] Sensitive to strong harmonic sounds

    Two-way mismatch [Mah94] Error function sensitive to the deviation of measured

    partials/sinusoids from ideal harmonic locations

    F0-candidate pruning Sort in ascending order of TWM errors Prune weaker F0-candidates in close vicinity

    (25 cents) of stronger F0 candidates

    PresenterPresentation NotesTWM error biased towards voice because of large spectral spread of voice (till 5 kHz). Harmonics always predicted till 5 kHz.

  • Department of Electrical Engineering , IIT Bombay 11 of 25of 40

    SYSTEM DESIGNPredominant-F0 Trajectory Extraction

    -0.4 -0.2 0 0.2 0.40

    0.2

    0.4

    0.6

    0.8

    1

    Log change in pitch

    Normalized distribution of adjacent frame pitch transitions for male & female singers (Hop =10 ms)

    -1 -0.5 0 0.5 10

    0.2

    0.4

    0.6

    0.8

    1Cost functions

    Log change in pitch

    Std. Dev = 0.1

    OJC = 1.0

    ( )2W(p,p ) OJC. log p '/ p =

    ( ) ( )( )22 2log p ' log p2W(p ,p ') 1 e

    =

    p and p are F0s in current and previous frames resp.

    Objective To find that path through the F0-candidate v/s time

    space that best represents the predominant-F0 trajectory

    Dynamic-programming [Ney83] based path finding Measurement cost = TWM error Smoothness cost must be based on musicological

    considerations

  • Department of Electrical Engineering , IIT Bombay 12 of 25of 40

    Genre Audio content PA (%) CA (%)

    Indianclassicalmusic

    Voice + percussion 97.4 99.5

    Voice + percussion + drone 92.9 98.2

    Voice + percussion + drone + harmonium 67.3 73.04

    Indian popmusic Voice + guitar 96.7 96.7

    Parameter Value

    Frame length 40 ms

    Hop 10 ms

    Lower limit on F0 100 Hz

    Upper limit on F0 1280 Hz

    Upper limit on spectral content 5000 Hz

    Data Classical: 4 min. of multi-track data, Film: 2 min. of multi-track data Ground truth: Output of YIN PDA [Chev02] on clean voice tracks with manual

    correction Evaluation metrics

    Pitch Accuracy (PA) = % of vocal frames whose pitch has been correctly tracked (within 50 cents)

    Chroma Accuracy (CA) = PA except that octave errors are forgiven

    EVALUATIONPredominant-F0 extraction: Indian Music

  • Department of Electrical Engineering , IIT Bombay 13 of 25of 40

    SYSTEM DESIGNVoicing Detection

    Frame-level After grouping

    Feature set Vocal recall (%)Instrumental

    recall (%)Vocal recall (%)

    Instrumental recall (%)

    FS1 92.17 66.43 97.86 61.61

    FS2 92.38 66.29 96.28 68.98

    FS3 89.05 92.10 93.15 96.58

    Features FS1 13 MFCCs FS2 7 static timbral features FS3 Normalized harmonic energy (NHE)

    Classifier GMM 4 mixtures per class

    Boundary detector Audio novelty detector [Foote] with NHE

    Data 23 min. of Hindustani training data 7 min. of Hindustani testing data

    Results on testing data

    Recall: % of actual frames that were correctly labeled

    Feature Extraction

    Classifier

    Boundary detector Grouping

    Polyphonic signal

    Decision labels

    PresenterPresentation NotesTried combining FS2 and FS3. did not increase performance. FS2 arrived at using MI and cross-validation experiments.

  • Department of Electrical Engineering , IIT Bombay 14 of 25of 40

    EVALUATIONSubmission to MIREX 2008 & 2009

    Music Signal

    Signal representation

    Predominant-F0 trajectory

    extraction

    Vocal segment pitch tracks

    Sinusoids frequencies and magnitudes

    DFT Main Lobe MatchingParabolic

    interpolation

    TWM error computation

    Sort (Ascending)

    Sub-multiples of sinusoids

    F0 search range

    Vicinity Pruning

    F0

    candidates

    Multi-F0 analysis

    F0 candidates and measurement costs

    Dynamic programming-based optimal path finding

    Predominant F0 contour

    Thresholding normalized

    harmonic energy

    Grouping over homogenous

    segmentsVoicing

    detection

    Music Information Retrieval Evaluation eXchange Started in 2004 International Music Information

    Retrieval Systems Evaluation Laboratory (IMIRSEL)

    Common platform for evaluation on common datasets

    Tasks Audio genre, artist, mood

    classification Audio melody extraction Audio beat tracking Audio Key detection Query by singing/ hummin Audio chord estimation

  • Department of Electrical Engineering , IIT Bombay 15 of 25of 40

    EVALUATIONMIREX 2008 & 2009 Datasets & Evaluation

    Data ADC 2004: Publicly available data

    20 excerpts (about 20 sec each) from pop, opera, jazz & midi MIREX 2005: Secret data

    25 excerpts (10 40 sec) from rock, R&B, pop, jazz, solo piano MIREX 2008: ICM data

    4 excerpts of 1 minute each from a male and female Hindustani vocal performance. 2 min. each with and without a loud harmonium

    MIREX 2009: MIR 1K data 374 Karaoke recordings of Chinese songs. Each recording is mixed at 3 different

    Signal-to-accompaniment ratios (SARs) {-5,0,5 dB}

    Evaluation metrics: Pitch evaluation

    Pitch accuracy (PA) and Chroma accuracy (CA) Voicing evaluation

    Vocal recall (Vx recall) and Vocal false alarm rate (Vx false alm) Overall accuracy

    % of correctly detected vocal frames with correctly detected pitch Run-time

    PresenterPresentation NotesOverall accuracy is computed as proportion of frames correctly labeled with pitch and voicing. Since there is a strong bias 80/20 towards voiced frames. The more voiced frames detected the better overall accuracy you get. We set our voicing threshold as per the best trade-off obtained between voiced/instrumental frame detection. If we bias the voicing threshold towards detecting more voiced frames, the overall accuracy was found to improve and come on par with the best systems.

  • Department of Electrical Engineering , IIT Bombay 16 of 25of 40

    EVALUATIONMIREX 2009 & 2010 MIREX05 dataset (vocal)

    2009

    Participant VxRecallVx

    False AlmPitch

    accuracyChromaaccuracy

    Overall Accuracy

    Runtime(dd:hh:mm)

    cl1 91.2267 67.7071 70.807 73.924 59.7095 -cl2 80.9512 44.8062 70.807 73.924 64.4609 -dr1 93.7543 53.5255 76.1145 77.7138 66.9613 -dr2 88.0635 38.0207 70.9258 75.9161 66.5175 -hjc1 65.8441 19.8206 62.6594 73.4868 54.8538 -hjc2 65.8441 19.8206 54.1294 69.4221 52.2905 -jjy 88.8546 41.9813 76.2696 79.3236 66.3123 -kd 82.6017 15.2554 77.4622 80.8218 76.962 -mw 99.9345 99.7947 75.7398 80.3791 53.7323 -pc 75.6436 21.0879 71.7068 72.5089 70.465 -rr 92.908 56.3639 75.9506 79.1084 65.7745 -

    toos 91.2267 67.7071 70.807 73.924 59.7095 -2010

    HJ1 70.80 44.90 71.70 74.93 53.89 00:59:31TOOS1 84.65 41.93 68.87 74.62 60.84 00:50:31JJY2 96.86 69.62 70.17 78.32 60.81 01:09:30JJY1 97.33 70.16 71.58 78.61 61.54 03:48:02SG1 76.39 22.78 61.80 73.70 62.13 00:08:15

    PresenterPresentation NotesOverall accuracy is computed as proportion of frames correctly labeled with pitch and voicing. Since there is a strong bias 80/20 towards voiced frames. The more voiced frames detected the better overall accuracy you get. We set our voicing threshold as per the best trade-off obtained between voiced/instrumental frame detection. If we bias the voicing threshold towards detecting more voiced frames, the overall accuracy was found to improve and come on par with the best systems.

  • Department of Electrical Engineering , IIT Bombay 17 of 25of 40

    EVALUATIONMIREX 2009 & 2010 MIREX09 dataset (0 dB mix)

    2009

    Participant VxRecallVx

    False AlmPitch

    accuracyChromaaccuracy

    Overall Accuracy

    Runtime(dd:hh:mm)

    cl1 92.4858 83.5749 59.138 62.9508 43.9659 00:00:28cl2 77.2085 59.7352 59.138 62.9508 49.2294 00:00:33dr1 91.8716 55.3555 69.8804 72.5138 60.1294 16:00:00dr2 87.3985 47.3422 66.549 70.7923 59.5076 00:08:44hjc1 34.1722 1.7909 72.6577 75.2906 53.1752 00:05:44hjc2 34.1722 1.7909 51.6871 70.002 51.7469 00:09:38jjy 38.906 19.4063 75.9354 80.2461 49.686 02:14:06kd 91.1846 47.7842 80.4565 81.8811 68.2237 00:00:24mw 99.992 99.4688 67.2905 71.0018 43.6365 00:02:12pc 73.1175 43.4773 50.8895 53.3672 51.5001 03:05:57rr 88.8091 50.7595 68.6242 71.3714 60.7733 00:00:26

    toos 99.9829 99.4185 82.2943 85.7474 53.5623 01:00:282010

    HJ1 82.06 14.27 83.15 84.23 76.17 14:39:16TOOS1 94.17 38.58 82.59 86.18 72.23 12:07:21JJY2 98.33 70.62 81.29 83.83 62.55 14:06:20JJY1 98.34 70.65 82.20 84.57 62.90 65:21:11SG1 89.65 30.22 80.05 85.50 73.59 01:56:27

    PresenterPresentation NotesOverall accuracy is computed as proportion of frames correctly labeled with pitch and voicing. Since there is a strong bias 80/20 towards voiced frames. The more voiced frames detected the better overall accuracy you get. We set our voicing threshold as per the best trade-off obtained between voiced/instrumental frame detection. If we bias the voicing threshold towards detecting more voiced frames, the overall accuracy was found to improve and come on par with the best systems.

  • Department of Electrical Engineering , IIT Bombay 18 of 25of 40

    No bold increase in melody extraction over the last 3 years (2007-2009) [Dres2010]

    Errors due to loud pitched accompaniment Accompaniment pitch tracked instead of voice

    Error in Predominant-F0 trajectory extraction

    Accompaniment pitch tracked along with voice Error in voicing detection

    Errors due to signal dynamics Octave errors due to fixed window length

    Error in Signal representation

    EVALUATION Problems in Melody Extraction

  • Department of Electrical Engineering , IIT Bombay 19 of 25of 40

    Incorrect tracking of loud pitched accompaniment

    ICM Data Largest reduction in accuracy for audio in which Voice displays large rapid modulations Instrument pitch is flat

    Predominant F0 trajectory DP-based path finding Based on suitably defined

    Measurement cost Smoothness cost

    Accompaniment errors Bias in measurement cost: Salient (spectrally rich) instrument Bias in smoothness cost: Stable-pitched instrument

    ENHANCEMENTS: PREDOMINANT-F0 TRACKINGProblems

  • Department of Electrical Engineering , IIT Bombay 20 of 25of 40

    Extension of DP to track F0 candidate ordered pairs (nodes) called Dual-F0 tracking

    Node formation All possible pairs

    Computationally expensive (10P2 = 90) F0 and (sub) multiple may be tracked

    Prohibit pairing of harmonically related F0 candidates Low harmonic threshold of 5 cents Allows pairing of voice F0 and octave-separated instrument F0 because of voice

    detuning

    Node measurement cost computation Joint TWM error [Mah94]

    Node smoothness cost computation Sum of corresponding F0 candidate smoothness costs

    Final selection of predominant-F0 contour Based on voice-harmonic instability

    ENHANCEMENTS: PREDOMINANT-F0 TRACKINGDesign & Implementation

    PresenterPresentation NotesnPk = fact(n)/fact(n-k). Also modeling interference. At most one other competing pitched instrument will be present.

  • Department of Electrical Engineering , IIT Bombay 21 of 25of 40

    Harmonic Sinusoidal Model (HSM) Partial tracking algorithm used in SMS [Serra98] Tracks are indexed and linked by harmonic number

    Std. dev. pruning Prune tracks in 200 ms segments whose std. dev.

  • Department of Electrical Engineering , IIT Bombay 22 of 25of 40

    ENHANCEMENTS: PREDOMINANT-F0 TRACKINGFinal Implementation Block Diagram

    Music Signal

    Signal representation

    Sinusoids frequencies and magnitudes

    DFTMain-lobe matching

    Parabolic interpolation

    TWM errorcomputation

    Sorting (Ascending)

    Sub-multiples of sinusoids

    F0 search

    range

    Vicinity pruning

    F0

    candidates

    Multi-F0 analysis

    F0 candidates and saliences

    Optimal path finding

    Ordered pairingof F0 candidates

    with harmonic constraintJoint TWM error

    computation

    Nodes (F0 pairs)

    Melodic contour

    Optimal path finding

    Melodic contour

    Predominant-F0 trajectory extractionVocal pitch

    identification

    PresenterPresentation NotesInsert box between dual-F0 output and Single-F0 output

  • Department of Electrical Engineering , IIT Bombay 23 of 25of 40

    Participating systems TWMDP (single- and dual-F0) LIWANG [LiWang07]

    Uses HMM to track predominant-F0 Includes the possibility of a 2-pitch hypothesis but finally outputs a single F0 Shown to be superior to other contemporary systems

    Same F0 search range [80 500 Hz]

    Evaluation metrics Multi-F0 stage

    % presence of true voice F0 in candidate list Predominant-F0 extraction (PA & CA)

    Single F0 Dual-F0

    Final contour accuracy Either-Or accuracy : Correct pitch is present in at least one of the two outputs

    ENHANCEMENTS: PREDOMINANT-F0 TRACKINGExperimental evaluation: Setup

  • Department of Electrical Engineering , IIT Bombay 24 of 25of 40

    ENHANCEMENTS: PREDOMINANT-F0 TRACKINGExperimental Evaluation: Data & Results

    DATASET DESCRIPTION VOCAL(SEC)TOTAL(SEC)

    1 Li & Wang data 55.4 97.5

    2 Examples from MIR-1k dataset with loud pitched accompaniment 61.8 98.1

    3 Examples from MIREX08 data (Indian classical music) 91.2 99.2

    TOTAL 208.4 294.8

    DATASETPERCENTAGE PRESENCE OF VOICE-F0 (%)

    TOP 5CANDIDATES

    TOP 10 CANDIDATES

    1 92.9 95.4

    2 88.5 95.1

    3 90.0 94.1

    10 5 0 -50

    20

    40

    60

    80

    100

    SARs (dB)

    Pitch

    acc

    urac

    ies (%

    )

    (a)

    LIWANG TWMDP

    10 5 0 -50

    20

    40

    60

    80

    100

    SARs (dB)

    Chr

    oma

    accu

    raci

    es (%

    )

    (b)

    LIWANG TWMDP

    Multi-F0 EvaluationData

    Comparison of TWMDP single-F0 tracker and LIWANG for dataset 1

    PresenterPresentation NotesInsert sound examples

  • Department of Electrical Engineering , IIT Bombay 25 of 25of 40

    ENHANCEMENTS: PREDOMINANT-F0 TRACKINGExperimental evaluation: Results (Dual-F0)

    TWMDP (% IMPROVEMENT OVER LIWANG (A1))

    DATA-SET

    SINGLE-F0 (A2)DUAL-F0

    EITHER-OR FINAL (A3)

    1PA (%) 88.5 (8.3) 89.3 (0.9) 84.1 (2.9)

    CA (%) 90.2 (6.4) 92.0 (1.1) 88.8 (3.9)

    2PA (%) 57.0 (24.5) 74.2 (-6.8) 69.1 (50.9)

    CA (%) 61.1 (14.2) 81.2 (-5.3) 74.1 (38.5)

    3PA (%) 66.0 (11.3) 85.7 (30.2) 73.9 (24.6)

    CA (%) 66.5 (9.7) 87.1 (18.0) 76.3 (25.9)

    TWMDP Single-F0 significantly better than LIWANG system for all datasets

    TWMDP Dual-F0 significantly better than TWMDP single-F0 for datasets 2 & 3

    Scope for further improvement in final predominant-F0 identification Indicated by difference between TWMDP Dual-F0 Either-Or and Final accuracies

    0

    10

    20

    30

    40

    50

    60

    70

    80

    D2 (PA) D2 (CA) D3(PA) D3 (CA)

    A3A2A1

    PresenterPresentation NotesState reasons for TWMDP better than LIWANG

  • Department of Electrical Engineering , IIT Bombay 26 of 25of 40

    ENHANCEMENTS: PREDOMINANT-F0 EXTRACTIONExample of F0 collisions

    Contour switching occurs at F0 collisions

    1

    1.5

    2F0

    (Oct

    aves

    ref.

    110

    Hz)

    (a) Single-F0 tracking

    1

    1.5

    2(b) Dual-F0 tracking (intermediate)

    0 1 2 3 41

    1.5

    2

    Time (sec)

    (c) Dual-F0 Tracking (final)

    Ground truth voice pitchGround truth harmonium pitchSingle-F0 output

    Ground truth voice pitchDual F0 contour 1Dual F0 contour 2

    Ground truth voice pitchDual F0 final output

    PresenterPresentation NotesInsert vocal harmony example

  • Department of Electrical Engineering , IIT Bombay 27 of 25of 40

    ENHANCEMENTS: VOICING DETECTIONProblems

  • Department of Electrical Engineering , IIT Bombay 28 of 25of 40

    ENHANCEMENTS: VOICING DETECTIONFeatures

    C1Static timbral

    C2Dynamic timbral

    C3Dynamic F0-Harmonic

    F0 10 Harmonic powers Mean & median of F0

    10 Harmonic powers SC & SE Mean, median & Std.Dev. of Harmonic [0 2 kHz]

    Spectral centroid (SE) Std. Dev. of SC for 0.5, 1& 2 secMean, median & Std.Dev. of Harmonic [2

    5 kHz]

    Sub-band energy (SE) MER of SC for 0.5, 1 & 2secMean, median & Std.Dev. of Harmonics 1

    to 5

    Std. Dev. of SE for 0.5, 1 & 2 sec

    Mean, median & Std.Dev. of Harmonics 6 to10

    MER of SE for 0.5, 1 & 2 secMean, median & Std.Dev. of Harmonics 1

    to10

    Ratio of mean, median & Std.dev. of Harmonics 1 to 5 : Harmonics 6 to 10

    Proposed feature set combination of static & dynamic features

    Features extracted using a harmonic sinusoidal model representation

    Feature selection in each feature set using information entropy [Weka]

    MER Modulation energy ratio

  • Department of Electrical Engineering , IIT Bombay 29 of 25of 40

    ENHANCEMENTS: VOICING DETECTIONData

    Genre Singing Dominant InstrumentI

    WesternSyllabic. No large pitch modulations. Voice often softer than instrument.

    Mainly flat-note (piano, guitar). Pitch range overlapping with voice.

    II

    GreekSyllabic. Replete with fast, pitch

    modulations.

    Equal occurrence of flat-note plucked-string /accordion and of pitch-modulated

    violin.

    III

    Bollywood

    Syllabic. More pitch modulations than western but lesser than other Indian

    genres.

    Mainly pitch-modulated wood-wind & bowed instruments. Pitches often

    much higher than voice.

    IV

    Hindustani

    Syllabic and melismatic. Varies from long, pitch-flat, vowel-only notes to

    large & rapid modulations.

    Mainly flat-note harmonium (woodwind). Pitch range overlapping

    with voice.

    VCarnatic

    Syllabic and melismatic. Replete with fast pitch modulations.

    Mainly pitch-modulated violin. F0 range generally higher than voice but

    has some overlap in pitch range.

    Genre Number of songsVocal

    durationInstrumental

    durationOverall

    durationI. Western 11 7m 19s 7m 02s 14m 21sII. Greek 10 6m 30s 6m 29s 12m 59s

    III. Bollywood 13 6m 10s 6m 26s 12m 36sIV. Hindustani 8 7m 10s 5m 24s 12m 54s

    V. Carnatic 12 6m 15s 5m 58s 12m 13sTotal 45 33m 44s 31m 19s 65m 03s

  • Department of Electrical Engineering , IIT Bombay 30 of 25of 40

    0.5 0.6 0.7 0.8 0.9 10.6

    0.7

    0.8

    0.9

    1

    Vocal Recall

    Voca

    l Pre

    cision

    Vocal Precision v/s Recall curves for different feature sets across genres in 'Leave 1 song out' experiment

    MFCCC1C1+C2+C3

    ENHANCEMENTS: VOICING DETECTIONEvaluation

    Two cross-validation experiments Intra-genre Leave 1 song out Inter-genre Leave 1 genre out

    Feature combination Concatenation Classifier combination

    Baseline features 13 MFCCs [Roc07]

    EvaluationVocal Recall (%) and precision (%)

    Overall Results C1 better than baseline C1+C2+C3 better than C1 Classifier combination better than feature

    concatenation

  • Department of Electrical Engineering , IIT Bombay 31 of 25of 40

    ENHANCEMENTS: VOICING DETECTIONEvaluation (contd.)

    I II III IV V Total

    Baseline 77.2 66.0 65.6 82.6 83.2 74.9

    F0-MFCCs 76.9 72.3 70.0 78.9 83.0 76.2

    C1 81.1 67.8 74.8 78.9 84.5 77.4

    C1+C2 82.9 78.5 82.8 79.5 85.0 81.7

    C1+C3 81.5 72.9 77.6 83.9 84.9 80.2

    C1+C2+C3 82.1 81.1 83.5 83.0 84.7 82.8

    I II III IV V Total

    Baseline 77.2 66.0 65.6 82.6 83.2 74.9

    F0-MFCCs 78.9 77.8 78.0 85.9 85.9 81.2

    C1 79.6 77.4 79.3 82.3 87.1 81.0

    C1+C2 82.3 83.6 85.4 83.3 86.8 84.2

    C1+C3 80.2 83.4 81.7 89.7 88.2 84.5

    C1+C2+C3 81.1 86.9 86.4 88.5 87.3 85.9

    Leave 1 Song out (Recall %)

    Semi-automatic F0-driven HSM Fully-automatic F0-driven HSM

    Genre-specific feature set adaptation C1+C2 Western C1+C3 - Hindustani

  • Department of Electrical Engineering , IIT Bombay 32 of 25of 40

    Relation between window length and signal characteristics Dense spectrum (multiple harmonic sources) -> long window Non-stationarity (rapid pitch modulations) -> short window

    Adaptive time segmentation for signal modeling and synthesis [Good97] Based on minimizing reconstruction error between synthesized and original

    signals High computational cost

    Easily computable measures for adapting window length Signal sparsity sparse spectrum has concentrated components

    Window length selection (23.2, 46.4 92.9 ms) based on maximizing signal sparsity

    ENHANCEMENTS: SIGNAL REPRESENATIONSparsity-driven window length adaptation

    L2 Norm Normalized kurtosis Gini Index Hoyer measure Spectral flatness

    ( )22 nk

    L X k=

    4

    22

    1 ( )

    1 ( )

    nk

    nk

    X k XN

    KU

    X k XN

    =

    ( )

    1

    0.51 2k

    n

    k

    X N kGIX N

    + =

    ( )( ) 1

    21

    ( )

    nk

    nk

    X kHO N N

    X k

    =

    2

    2

    ( )

    1 ( )

    N nk

    nk

    X kSF

    X kN

    =

    PresenterPresentation NotesXbar is mean spectral amplitude. X superscript k is X(k) sorted in ascending order to give ordered magnitude spectral coefficients. Maximize KU, HO, GI and minimize L2 and ASF

  • Department of Electrical Engineering , IIT Bombay 33 of 25of 40

    ENHANCEMENTS: SIGNAL REPRESENATIONSparsity-driven window length adaptation [contd.]

    Experimental comparison between fixed and adaptive schemes Fixed and adaptive window lengths (different sparsity measures)

    Sinusoid detection by main-lobe matching

    Data Simulations: Two sound mixtures (Polyphony) and vibrato signal Real: Western pop (Whitney, Mariah) and Hindustani taans

    Evaluation metrics Recall (%) and frequency deviation (Hz) Expected harmonic locations computed from ground-truth pitch

    Results1. Adaptive higher recall and

    lower frequency deviation

    2. Kurtosis driven adaptation is superior than other sparsitymeasures

  • Department of Electrical Engineering , IIT Bombay 34 of 25of 40

    Generalized music transcriptions system still unavailable

    Solution [Wang08] Semi-automatic approach Application-specific design E.g. music tutoring

    Two, possibly independent, aspects of melody extraction Voice pitch extraction Manually difficult Vocal segment detection Manually easier

    Semi-automatic tool Goal: To facilitate the extraction & validation of the voice pitch in polyphonic

    recordings with minimal human intervention Design considerations

    Accurate pitch detection Completely parametric control User-friendly control for vocal segment detection

    GRAPHICAL USER INTERFACEMotivation

    PresenterPresentation NotesWe are not going to neglect voicing detection. In fact we have devised a feature that exploits pitch-instability to discriminate between voice and stable-pitch instruments [ICSLP09]

  • Department of Electrical Engineering , IIT Bombay 35 of 25of 40

    Salient features Melody extraction back-end

    Validation Visual: Spectrogram Aural: Re-synthesis

    Segmental parameter variation

    Easy non-vocal labeling

    Saving final result & parameters

    Selective use of dual-F0 tracker Switching between contours

    GRAPHICAL USER INTERFACEDesign

    A Waveform viewerB Spectrogram & pitch viewC Menu barD Controls for viewing, scrolling, playback

    & volume controlE Parameter windowF Log viewer

    PresenterPresentation NotesWe are not going to neglect voicing detection. In fact we have devised a feature that exploits pitch-instability to discriminate between voice and stable-pitch instruments [ICSLP09]

  • Department of Electrical Engineering , IIT Bombay 36 of 25of 40

    CONCLUSIONS AND FUTURE WORKFinal system block diagram

    Music Signal

    Signal representation

    Sinusoids frequencies and magnitudes

    DFTMain-lobe matching

    Parabolic interpolation

    TWM errorcomputation

    Sorting (Ascending)

    Sub-multiples of sinusoids

    F0 search

    range

    Vicinity pruning

    F0

    candidates

    Multi-F0 analysis

    F0 candidates and saliences

    Optimal path finding

    Ordered pairingof F0 candidates

    with harmonic constraintJoint TWM error

    computation

    Nodes (F0 pairs and saliences)

    Predominant F0 contour

    Optimal path finding

    Predominant F0 contour

    Predominant F0 trajectory

    extraction

    Vocal pitch identification

    Harmonic Sinusoidal Model

    Feature Extraction

    Classifier

    Grouping

    Boundary Deletion

    Voice Pitch Contour

    Voicing Detector

  • Department of Electrical Engineering , IIT Bombay 37 of 25of 40

    State-of-the-art melody extraction system designed by making careful choices for system modules

    Enhancements to above system increase robustness to loud, pitched accompaniment Dual-F0 tracking for predominant-F0 extraction Combination of static & dynamic, timbral & F0-harmonic features for

    voicing detection

    Fully-automatic, high accuracy melody extraction still not feasible Large variability in underlying signal conditions due to diversity of music

    A priori knowledge of music and signal conditions Male/Female singer Rate of pitch variation

    High accuracy melodic contours can be extracted using a semi-automatic approach

    CONCLUSIONS AND FUTURE WORKConclusions

  • Department of Electrical Engineering , IIT Bombay 38 of 25of 40

    CONCLUSIONS AND FUTURE WORKSummary of contributions

    Design & validation of a novel, practically useful melody extraction system with increased robustness to pitched accompaniment Signal representation

    Choice of main-lobe matching criterion for sinusoid identification Improved sinusoid detection by signal sparsity driven window length adaptation

    Multi-F0 analysis Choice of TWM error as salience function Improved voice-F0 detection by separation of F0 candidate identification & salience

    computation Predominant-F0 trajectory extraction

    Gaussian log smoothness cost Dual-F0 tracking Final predominant-F0 contour identification by voice-harmonic instability

    Voicing detection Use of predominant-F0-derived signal representation Combination of static and dynamic, timbral and F0-harmonic features

    Design of a novel graphical user interface for semi-automatic use of the melody extraction system

  • Department of Electrical Engineering , IIT Bombay 39 of 25of 40

    Melody Extraction Identification of single predominant-F0 contour from dual-F0 output

    Use of dynamic features F0 collisions

    Detection based on minima in difference of constituent F0s of nodes Correction allowing pairing of F0 with itself around these locations Use of prediction-based partial tracking [Lag07]

    Validation across larger, more diverse datasets Incorporate predictive path-finding in DP algorithm Extend algorithm to instrumental pitch tracking in polyphony

    Homophonic music Lead instrument (e.g. flute) with accompaniment Polyphonic instruments (sitar)

    Applications of Melody Extraction Singing evaluation & feedback QBSH systems Musicological studies

    CONCLUSIONS AND FUTURE WORKFuture work

  • Department of Electrical Engineering , IIT Bombay 40 of 25of 40

    International Journals

    V. Rao, P. Gaddipati and P. Rao, Signal-driven window adaptation for sinusoid identification in polyphonicmusic, IEEE Transactions on Audio, Speech, and Language Processing, 2011. (accepted)

    V. Rao and P. Rao, Vocal melody extraction in the presence of pitched accompaniment in polyphonic music,IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no. 8, pp. 21452154, Nov. 2010.

    International Conferences

    V. Rao, C. Gupta and P. Rao, Context-aware features for singing voice detection in polyphonic music, 9thInternational Workshop on Adaptive Multimedia Retrieval, 2011. (submitted for review).

    V. Rao, S. Ramakrishnan and P. Rao, Singing voice detection in polyphonic music using predominant pitch, inProceedings of InterSpeech, Brighton, U.K., 2009.

    V. Rao and P. Rao, Improving polyphonic melody extraction by dynamic programming-based dual-F0 tracking,in Proceedings of the 12th International Conference on Digital Audio Effects (DAFx), Como, Italy, 2009.

    V. Rao and P. Rao, Vocal melody detection in the presence of pitched accompaniment using harmonicmatching methods, in Proceedings of the 11th International Conference on Digital Audio Effects (DAFx),Espoo, Finland, 2008.

    A. Bapat, V. Rao and P. Rao, Melodic contour extraction for Indian classical vocal music, in Proceedings ofMusic-AI (International Workshop on Artificial Intelligence and Music) in IJCAI, Hyderabad, India, 2007.

    V. Rao and P. Rao, Melody extraction using harmonic matching, in Proceedings of the Music InformationRetrieval Exchange MIREX 2008 & 2009, URL: http://www.music-ir.org/mirex/abstracts/2009/RR.pdf

    CONCLUSIONS AND FUTURE WORKList of related publications

  • Department of Electrical Engineering , IIT Bombay 41 of 25of 40

    CONCLUSIONS AND FUTURE WORKList of related publications [contd.]

    National Conferences

    S. Pant, V. Rao and P. Rao, A melody detection user interface for polyphonic music, in Proceedings of NationalConference on Communication (NCC), Chennai, India, 2010.

    N. Santosh, S. Ramakrishnan, V. Rao and P. Rao, Improving singing voice detection in the presence of pitchedaccompaniment, in Proceedings of National Conference on Communication (NCC), Guwahati, India, 2009.

    V.Rao, S. Pant, M. Bhaskar and P. Rao, Applications of a semi-automatic melody extraction interface for Indianmusic, in Proceedings of International Symposium on Frontiers of Research in Speech and Music (FRSM),Gwalior, India, Dec. 2009.

    V. Rao, S. Ramakrishnan and P. Rao, Singing voice detection in north Indian classical music, in Proceedings ofNational Conference on Communication (NCC), Mumbai, India, 2008.

    V. Rao and P. Rao, Objective evaluation of a melody extractor for north Indian classical vocal performances, in Proceedings of International Symposium on Frontiers of Research in Speech and Music (FRSM), Kolkota, India, 2008.

    V. Rao and P. Rao, Vocal trill and glissando thresholds for Indian listeners, in Proceedings of International Symposium on Frontiers of Research in Speech and Music (FRSM), Mysore, India, 2007.

    Patent

    P. Rao, V. Rao and S. Pant, A device and method for scoring a singing voice, Indian Patent Application, No. 1338/MUM/2009, Filed June 2, 2009.

  • Department of Electrical Engineering , IIT Bombay 42 of 25of 40

    REFERENCES[Pol07] G. Poliner, D. Ellis, A. Ehmann, E. Gomez, S. Streich and B. Ong, Melody transcription from music audio:

    Approaches and evaluation, IEEE Trans. Audio, Speech, Lang., Process., vol. 15, no. 4, pp. 12471256, May 2007.

    [Grif88] D. Griffin and J. Lim, Multiband Excitation Vocoder, IEEE Trans. on Acoust., Speech and Sig. Process., vol. 36, no. 8, pp. 12231235, 1988.

    [Wang08] Y. Wang and B. Zhang, Application-specific music transcription for tutoring, IEEE Multimedia, vol. 15, no. 3, pp. 7074, 2008.

    [Chev02] A. de Cheveign and H. Kawahara, YIN, a Fundamental Frequency Estimator for Speech and Music. J. Acoust. Soc. America, vol. 111, no. 4, pp. 19171930, 2002.

    [LiWang07] Y. Li and D. Wang, Separation of singing voice from music accompaniment for monoauralrecordings, IEEE Trans. Audio, Speech and Lang. Process., vol. 15, no. 4, pp. 1475-1487, 2007.

    [Mah94] R. Maher and J. Beauchamp, Fundamental Frequency Estimation of Musical Signals using a Two-Way Mismatch Procedure, J. Acoust. Soc. Amer., vol. 95, no. 4, pp. 22542263, Apr. 1994.

    [Dres2010] K. Dressler, Audio melody extraction for MIREX 2009, Ilmenau: Fraunhofer IDMT, 2010.

    [Ney83] H. Ney, Dynamic Programming Algorithm for Optimal Estimation of Speech Parameter Contours, IEEE Trans. Systems, Man and Cybernetics, vol. SMC-13, no. 3, pp. 208214, Apr. 1983.

    [Lag07] M. Lagrange, S. Marchand and J. B. Rault, Enhancing the tracking of partials for the sinusoidal modeling of polyphonic sounds, IEEE Trans. Audio, Speech and Lang. Process., vol. 15, no. 5, pp. 1625-1634, 2007.

    [Chao09] C. Hsu and R. Jang, On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset, IEEE Trans. Audio, Speech, and Lang. Process., 2009, accepted.

    [Good97] M. Goodwin, Adaptive signal models: Theory, algorithms and audio applications, Ph. D. dissertation, MIT, 1997.

    [Lag08] M. Lagrange, L. Martins, J. Murdoch and G. Tzanetakis, Normalised cuts for predominant melodic source separation, IEEE Trans. Audio, Speech, Lang., Process. (Sp. Issue on MIR), vol. 16, no.2, pp. 278290, Feb. 2008.

    [Pol05] G. Poliner and D. Ellis, A classification approach to melody transcription, in Proc. Intl. Conf. Music Information Retrieval, London, 2005.

    Vocal Melody Extraction from Polyphonic Audio with Pitched AccompanimentOUTLINEINTRODUCTIONObjectiveINTRODUCTIONBackgroundINTRODUCTIONMotivation, Complexity and ApproachesINTRODUCTIONIndian classical music: Signal characteristicsINTRODUCTIONMelody extraction in Indian classical musicSYSTEM DESIGNOur ApproachSYSTEM DESIGNSignal RepresentationSYSTEM DESIGNMulti-F0 AnalysisSYSTEM DESIGNPredominant-F0 Trajectory ExtractionEVALUATIONPredominant-F0 extraction: Indian MusicSYSTEM DESIGNVoicing DetectionEVALUATIONSubmission to MIREX 2008 & 2009EVALUATIONMIREX 2008 & 2009 Datasets & EvaluationEVALUATIONMIREX 2009 & 2010 MIREX05 dataset (vocal)EVALUATIONMIREX 2009 & 2010 MIREX09 dataset (0 dB mix)EVALUATION Problems in Melody ExtractionENHANCEMENTS: PREDOMINANT-F0 TRACKINGProblemsENHANCEMENTS: PREDOMINANT-F0 TRACKINGDesign & ImplementationENHANCEMENTS: PREDOMINANT-F0 TRACKINGSelection of Predominant-F0 contourENHANCEMENTS: PREDOMINANT-F0 TRACKINGFinal Implementation Block DiagramENHANCEMENTS: PREDOMINANT-F0 TRACKINGExperimental evaluation: SetupENHANCEMENTS: PREDOMINANT-F0 TRACKINGExperimental Evaluation: Data & ResultsENHANCEMENTS: PREDOMINANT-F0 TRACKINGExperimental evaluation: Results (Dual-F0)ENHANCEMENTS: PREDOMINANT-F0 EXTRACTIONExample of F0 collisionsENHANCEMENTS: VOICING DETECTIONProblemsENHANCEMENTS: VOICING DETECTIONFeaturesENHANCEMENTS: VOICING DETECTIONDataENHANCEMENTS: VOICING DETECTIONEvaluationSlide Number 31ENHANCEMENTS: SIGNAL REPRESENATIONSparsity-driven window length adaptationENHANCEMENTS: SIGNAL REPRESENATIONSparsity-driven window length adaptation [contd.]Slide Number 34Slide Number 35CONCLUSIONS AND FUTURE WORKFinal system block diagramCONCLUSIONS AND FUTURE WORKConclusionsCONCLUSIONS AND FUTURE WORKSummary of contributionsCONCLUSIONS AND FUTURE WORKFuture workCONCLUSIONS AND FUTURE WORKList of related publicationsCONCLUSIONS AND FUTURE WORKList of related publications [contd.]REFERENCES