Description et Classification automatique des sons instrumentaux Geoffroy Peeters Ircam (Analysis/Synthesis Team) [email protected]

Description et Description et Classification automatique Classification automatique des sons instrumentaux des sons instrumentaux

Geoffroy PeetersGeoffroy Peeters Ircam (Analysis/Synthesis Team) Ircam (Analysis/Synthesis Team)

[email protected] [email protected]

[email protected] 2

trumpettrumpet

1. Introduction1. Introduction1. Introduction1. Introduction

Musical Instrument Sound ClassificationMusical Instrument Sound Classification numerous studies on sound classificationnumerous studies on sound classification few of them address the problem of generalization of sound few of them address the problem of generalization of sound

sources sources (recognition of the same source possibly recorded in different (recognition of the same source possibly recorded in different conditions with various instrument manufacturers and players)conditions with various instrument manufacturers and players)

Evaluation of the system performanceEvaluation of the system performance training on a subset of the database, evaluation on the rest of training on a subset of the database, evaluation on the rest of

the databasethe database does not prove any applicability for the classification of sounds does not prove any applicability for the classification of sounds

which do not belong to the databasewhich do not belong to the database

Martin [1999] Martin [1999] 76% (family)76% (family) 39% for 14 instruments39% for 14 instruments Eronen [2001] Eronen [2001] 77% (family)77% (family) 35% for 16 instruments35% for 16 instruments

Goal of this studyGoal of this study study large database classificationstudy large database classification How ? New classification systemHow ? New classification system Extract a large amount of featuresExtract a large amount of features New feature selection algorithmNew feature selection algorithm Compare flat and hierarchical gaussian classifierCompare flat and hierarchical gaussian classifier

[email protected] 3

Feature extractionFeature extractionFeature selectionFeature selectionFeature TransformFeature TransformClassificationClassificationEvaluationEvaluation

Confusion matrixConfusion matrixWhich featuresWhich featuresClasses organizationClasses organization

[email protected] 4

temporal modeling

meanvariance

derivativemodulationpolynomial

Instantaneous (frame based) features- harmonic features- spectral shape features- perceptual features- MFCC, xcorr, zcr- MPEG-7 LLDs (spectral flatness, crest)

global features (attack time, increase/decrease)

2. Feature extraction2. Feature extraction2. Feature extraction2. Feature extraction

FeaturesExtraction

TemporalModeling

FeatureTransform:Gaussianity

FeatureSelectionIRMFSP

FeatureTransform

LDA

Classmodeling

Features for sound recognition:Features for sound recognition: speech recognition community, speech recognition community,

previous studies on musical previous studies on musical instrument sounds classification, instrument sounds classification, results of psycho-acoustical results of psycho-acoustical studiesstudies..

each feature set is supposed to each feature set is supposed to perform well for a specific taskperform well for a specific task

Principle:Principle: 1) extract a large set of features 1) extract a large set of features 2) filter the feature set a 2) filter the feature set a

posteriori by a Feature Selection posteriori by a Feature Selection AlgorithmAlgorithm

Whole set offeatures

Feature selectionalgorithm

Reduced set offeatures

Classes

A B C D E F G H I J K L M N ...

C K N

[email protected] 5

SignalDescriptorsExtraction

Module

InstantaneousDescriptors

GlobalDescriptors

TemporalModeling Descriptors

FundamentalFrequency

Segmentation

2. Feature extraction 2. Feature extraction Audio features TaxonomyAudio features Taxonomy 2. Feature extraction 2. Feature extraction Audio features TaxonomyAudio features Taxonomy

Global descriptorsGlobal descriptors Instantaneous descriptorsInstantaneous descriptors

Temporal modelingTemporal modeling Mean, Mean, VarianceVariance Modulation (pitch, energy)Modulation (pitch, energy)

[email protected] 6

FFT

SinusoidalHarmonic

ModelSignal frame

PerceptualModel

Signal

InstantaneousTemporal

Descriptors

InstantaneousSpectral

Descriptors

InstantaneousPerceptualDescriptors

InstantaneousHarmonic

Descriptors

GlobalTemporal

Descriptors

EnergyEnvelop

2. Feature extraction 2. Feature extraction Audio features TaxonomyAudio features Taxonomy2. Feature extraction 2. Feature extraction Audio features TaxonomyAudio features Taxonomy

DT: temporal descriptorsDT: temporal descriptors DE: energy descriptorsDE: energy descriptors DS: spectral descriptorsDS: spectral descriptors DH: harmonic descriptorsDH: harmonic descriptors DP: perceptual descriptorsDP: perceptual descriptors

[email protected] 7

2. Feature extraction 2. Feature extraction DT/DE: Temporal/Energy descriptorsDT/DE: Temporal/Energy descriptors2. Feature extraction 2. Feature extraction DT/DE: Temporal/Energy descriptorsDT/DE: Temporal/Energy descriptors

soundsound EnergyEnergy EnvelopEnvelop

DT.zero-crossing rateDT.zero-crossing rate DT.auto-correlationDT.auto-correlation

DT.log-attack timeDT.log-attack time DT.temporal increaseDT.temporal increase DT.temporal decreaseDT.temporal decrease DT.temporal centroid DT.temporal centroid DT.effective durationDT.effective duration

DE.total energyDE.total energy DE.energy of harmonic partDE.energy of harmonic part DE.energy of noise partDE.energy of noise part

[email protected] 8

2. Feature extraction 2. Feature extraction DS: Spectral descriptorsDS: Spectral descriptors2. Feature extraction 2. Feature extraction DS: Spectral descriptorsDS: Spectral descriptors

soundsound WindowWindow FFTFFT

DS.centroid, DS.spread, DS.skewness, DS.kurtosisDS.centroid, DS.spread, DS.skewness, DS.kurtosis DS.slope, DS.decrease, DS.roll-offDS.slope, DS.decrease, DS.roll-off DS.variationDS.variation

[email protected] 9

2. Feature extraction 2. Feature extraction DH: Harmonic descriptorsDH: Harmonic descriptors2. Feature extraction 2. Feature extraction DH: Harmonic descriptorsDH: Harmonic descriptors

DH.Centroid, DH.Spread, DH.Skewness, DH.KurtosisDH.Centroid, DH.Spread, DH.Skewness, DH.Kurtosis DH.Slope, DH.Decrease, DH.Roll-offDH.Slope, DH.Decrease, DH.Roll-off DH.VariationDH.Variation

DH.Fundamental frequencyDH.Fundamental frequency DH.Noisiness, DH.OddEvenRatio, DH.InharmonicityDH.Noisiness, DH.OddEvenRatio, DH.Inharmonicity DH.TristimulusDH.Tristimulus DH.DeviationDH.Deviation

soundsound WindowWindow FFTFFT Sinudoidal modelSinudoidal model

[email protected] 10

DP.Centroid, DP.Spread, DP.Skewness, DP.KurtosisDP.Centroid, DP.Spread, DP.Skewness, DP.Kurtosis DP.Slope, DP.Decrease, DP.Roll-offDP.Slope, DP.Decrease, DP.Roll-off DP.VariationDP.Variation

DP.Loudness, RelativeSpecific LoudnessDP.Loudness, RelativeSpecific Loudness DP.Sharpness, DP.SpreadDP.Sharpness, DP.Spread DP.Roughness, DP.FluctuationStrengthDP.Roughness, DP.FluctuationStrength

DV.MFCC, DV.Delta-MFCC, DV.Delta-Delta-MFCCDV.MFCC, DV.Delta-MFCC, DV.Delta-Delta-MFCC DV.SpectralFlatness, DV.SpectralCrestDV.SpectralFlatness, DV.SpectralCrest

soundsound WindowWindow FFTFFT PerceptionPerception

Mid-ear fileringMid-ear filering Bark scaleBark scale Mel scaleMel scale

2. Feature extraction 2. Feature extraction DP: Perceptual descriptors / DV: Various descriptorsDP: Perceptual descriptors / DV: Various descriptors2. Feature extraction 2. Feature extraction DP: Perceptual descriptors / DV: Various descriptorsDP: Perceptual descriptors / DV: Various descriptors


2. Feature extraction 2. Feature extraction Audio features designAudio features design2. Feature extraction 2. Feature extraction Audio features designAudio features design

No consensus on the use of amplitude and frequency No consensus on the use of amplitude and frequency scalescale All features are computed using the following scale:All features are computed using the following scale: Frequency scale: Frequency scale: linear / log / bark-bandslinear / log / bark-bands Amplitude scale: Amplitude scale: linear / power / loglinear / power / log

note: log(0.0)=-inftynote: log(0.0)=-infty -> normalization 24bits-> normalization 24bits

Features must be independent of the Features must be independent of the recording levelrecording level

Normalization in linear, in power scaleNormalization in linear, in power scale

Normalization in logarithmic scaleNormalization in logarithmic scale

Features must be independent of the Features must be independent of the sampling ratesampling rate Maximum frequency taken into account: 11025/2 HzMaximum frequency taken into account: 11025/2 Hz Resampling (for zcr, xcorr)Resampling (for zcr, xcorr)

a

fa

a

fasc

2

*2*

2

2

2

2

)2(

*)2(*

a

fa

a

fasc

a

a

fa

a

sc

log

*log





Whole set offeatures

Feature selectionalgorithm

Reduced set offeatures

Classes

A B C D E F G H I J K L M N ...

C K N

FeaturesExtraction

TemporalModeling



FeatureTransform

LDA

Classmodeling

3. Feature selection algorithm (FSA)3. Feature selection algorithm (FSA)3. Feature selection algorithm (FSA)3. Feature selection algorithm (FSA)

Problem: using a high number of featuresProblem: using a high number of features some features can be irrelevant for the given tasksome features can be irrelevant for the given task over fitting of the model to the training set (especially with LDA)over fitting of the model to the training set (especially with LDA) classification models are difficult to interpret by humanclassification models are difficult to interpret by human

Goal of feature selection algorithm (FTA)Goal of feature selection algorithm (FTA) find the minimal set offind the minimal set of

criterion 1) informative features with respect to the classescriterion 1) informative features with respect to the classescriterion 2) features that provide non redundant informationcriterion 2) features that provide non redundant information

Forms of feature selection algorithmForms of feature selection algorithm embedded: embedded: the FSA is part of the classifierthe FSA is part of the classifier filter: filter: the FSA is distinct from the classifier the FSA is distinct from the classifier

and used before the classifierand used before the classifier wrapper: wrapper: the FSA makes use of the classification resultsthe FSA makes use of the classification results


3. Feature selection algorithm: IRMFSP3. Feature selection algorithm: IRMFSP3. Feature selection algorithm: IRMFSP3. Feature selection algorithm: IRMFSP

Criterion 1 Criterion 1 informative features with respect informative features with respect to the classesto the classes

principleprinciple: “feature values for sounds : “feature values for sounds belonging to a specific class should be belonging to a specific class should be separated from the values for all the separated from the values for all the other classes »other classes »

measuremeasure: for a specific feature : for a specific feature ii ratio of ratio of the Between-class inertia B to the Total the Between-class inertia B to the Total class inertia Tclass inertia T

Criterion 2 Criterion 2 features that provide non redundant features that provide non redundant informationinformation

apply an orthogonalization process of the apply an orthogonalization process of the feature space after the selection of each new feature space after the selection of each new feature feature (Gram-Schmidt Orthogonalization)(Gram-Schmidt Orthogonalization)

N

niniini

K

kikiiki

k

mfmfN

mmmmNN

T

Br

1,,

1,,

)')((1

)')((

iii ffg / Fjggfff iijjj )('

Inertia Ratio Maximization using Feature Space Projection

whole set of feature

compute inertia ratio for allfeatures

take the feature with largest ratio

project the whole feature spaceon the selected feature

W 1

W 2

m-m 2

m-m1

m

m 1

m 2

f i

f i+1


3. Feature selection algorithm: IRMFSP3. Feature selection algorithm: IRMFSP3. Feature selection algorithm: IRMFSP3. Feature selection algorithm: IRMFSP

Example :Example :sustained/non-sustained sound separationsustained/non-sustained sound separation computation of the BT ratio for each featurecomputation of the BT ratio for each feature

feature with the weakest ratio (r=6.9e-6) feature with the weakest ratio (r=6.9e-6) Specific loudness m8 meanSpecific loudness m8 mean

feature with the highest ratio (r=0.58)feature with the highest ratio (r=0.58) Energy temporal decreaseEnergy temporal decrease

first three selected dimensions first three selected dimensions 1st dim: 1st dim: temporal decreasetemporal decrease 2nd dim: 2nd dim: spectral centroidspectral centroid 3rd dim: 3rd dim: temporal increasetemporal increase





Linear Discriminant AnalysisLinear Discriminant Analysis find linear combination among features in order to maximize find linear combination among features in order to maximize

discrimination between classes: F -> F’discrimination between classes: F -> F’

Total inertiaTotal inertia

Between Class InertiaBetween Class Inertia

Transform initial feature space Transform initial feature space FF by a transformation matrix by a transformation matrix UU

in order to maximize the ratio in order to maximize the ratio

Solution: Solution: eigen vectors of eigen vectors of associated to the eigen values associated to the eigen values

(discriminative power) (discriminative power)

FeaturesExtraction

TemporalModeling



FeatureTransform

LDA

Classmodeling

4. Feature transformation: LDA4. Feature transformation: LDA4. Feature transformation: LDA4. Feature transformation: LDA

n

iii mdmd

nT

1

)')((1

K

kkk

k mmmmn

nB

1

)')((

uTu

uBuru '

'

BT 1





top

node j-1 node jnodej+1

...

feature selectionbest set of features f1,f 2,...,F N ?

feature transformationLinear Discriminant Analysis matrix ?

for each classgaussian pdf parameters estimation

feature selectionuse only f 1,f 2,...,F N

feature transformationapply matrix

for each classevaluate Bayes formula

TRAINING

EVALUATION

f i

f i+1

FeaturesExtraction

TemporalModeling



FeatureTransform

LDA

Classmodeling

5. Class modeling: 5. Class modeling: flat classifiersflat classifiers5. Class modeling: 5. Class modeling: flat classifiersflat classifiers

Flat classifiersFlat classifiers Flat gaussian classifier (F-GC)Flat gaussian classifier (F-GC) ““Flat”= all classes considered on a same levelFlat”= all classes considered on a same level

Training: model each class k by a multi-dimensional Training: model each class k by a multi-dimensional gaussian pdf (mean vector, covariance matrix)gaussian pdf (mean vector, covariance matrix)

Evaluation: Bayes formulaEvaluation: Bayes formula

Flat KNN classifier (F-KNN)Flat KNN classifier (F-KNN) instance-based algorithminstance-based algorithm assign to the input sound the majority class among its K assign to the input sound the majority class among its K

Nearest Neighbors in the Feature SpaceNearest Neighbors in the Feature Space Euclidean distance => weighting of the axes ?Euclidean distance => weighting of the axes ?

Apply to the output of the LDA (implicit weighting of the axes)Apply to the output of the LDA (implicit weighting of the axes)


top


...







TRAINING

EVALUATION

node i

node j-1 node jnodej+1 ...

top

......

TRAINING

EVALUATION

FeaturesExtraction

TemporalModeling



FeatureTransform

LDA

Classmodeling

5. Class modeling: 5. Class modeling: hierarchical classifiershierarchical classifiers5. Class modeling: 5. Class modeling: hierarchical classifiershierarchical classifiers

Hierarchical classifiers (F-GC)Hierarchical classifiers (F-GC) Hierarchical gaussian classifier (H-GC)Hierarchical gaussian classifier (H-GC)

Training: a tree of flat gaussian classifierTraining: a tree of flat gaussian classifiereach node has its own FSA, FTA and each node has its own FSA, FTA and

F-GCF-GC Tree construction is supervised (>< decision tree)Tree construction is supervised (>< decision tree) Only the subset of sounds belonging to the classes of the Only the subset of sounds belonging to the classes of the

current node are usedcurrent node are used

Evaluation: local probability decides which branch of the Evaluation: local probability decides which branch of the tree to followtree to follow

Advantages of H-GCAdvantages of H-GC Learning facilities: it is easier to learn differences in a small Learning facilities: it is easier to learn differences in a small

subset of classessubset of classes Reduced class confusion: benefit from the higher recognition Reduced class confusion: benefit from the higher recognition

rate at the higher levels of the treerate at the higher levels of the tree

Hierarchical KNN classifier (H-KNN)Hierarchical KNN classifier (H-KNN)


FeaturesExtraction

TemporalModeling



FeatureTransform

LDA

Classmodeling

5. Class modeling: 5. Class modeling: hierarchical classifiershierarchical classifiers5. Class modeling: 5. Class modeling: hierarchical classifiershierarchical classifiers

Hierarchical classifiers (F-GC)Hierarchical classifiers (F-GC) Hierarchical gaussian classifier (H-GC)Hierarchical gaussian classifier (H-GC)

Training: a tree of flat gaussian classifierTraining: a tree of flat gaussian classifiereach node has its own FSA, FTA and each node has its own FSA, FTA and

F-GCF-GC Tree construction is supervised (>< decision tree)Tree construction is supervised (>< decision tree) Only the subset of sounds belonging to the classes of the Only the subset of sounds belonging to the classes of the

current node are usedcurrent node are used

Evaluation: local probability decides which branch of the Evaluation: local probability decides which branch of the tree to followtree to follow

Advantages of H-GCAdvantages of H-GC Learning facilities: it is easier to learn differences in a small Learning facilities: it is easier to learn differences in a small

subset of classessubset of classes Reduced class confusion: benefit from the higher recognition Reduced class confusion: benefit from the higher recognition

rate at the higher levels of the treerate at the higher levels of the tree

Hierarchical KNN classifier (H-KNN)Hierarchical KNN classifier (H-KNN)

Decision Trees: Decision Trees: Binary Entropy Reduction Tree (BERT)Binary Entropy Reduction Tree (BERT) C4.5.C4.5. Partial Decision Tree (PART)Partial Decision Tree (PART)

top


...







TRAINING

EVALUATION

node i

node j-1 node jnodej+1 ...

top

......

TRAINING

EVALUATION





GuitarHarp

Strings Woodwinds

Non Sustained

Instrument

Sustained

Struck Strings Plucked Strings Pizz Strings

Piano ViolinViolaCello

Double

Bowed Strings BrassSingle Double

ReedsAir Reeds

ViolinViolaCello

Double

TrumpetCornet

TromboneFrench Horn

Tuba

Single ReedsClarinet

Tenor saxAlto saxSop sax

AccordeonDouble Reeds

OboeBassoon

English horn

FlutePiccolo

Recorder

T1

T2

T3

6. Evaluation6. EvaluationTaxonomy usedTaxonomy used 6. Evaluation6. EvaluationTaxonomy usedTaxonomy used

Three different levelsThree different levels T1: sustained/non-sustained soundsT1: sustained/non-sustained sounds T2: instrument familiesT2: instrument families T3: instrument namesT3: instrument names


0

50

100

150

200

250

300

350

400

pian

o

guita

r

harp

viol

a-pi

zz

doub

le-p

izz

cello

-piz

z

viol

in-p

izz

viol

a

doub

le

cello

viol

in

fren

ch h

orn

corn

et

trom

bone

trum

pet

tuba

flute

picc

olo

reco

rder

acco

rdeo

n

bass

oon

clar

inet

engl

ish-

horn

oboe

saxs

op

saxa

lto

saxt

enor

Vi

Pro

Microsoft

McGill

Iowa

SOL

6. Evaluation6. EvaluationTest setTest set 6. Evaluation6. EvaluationTest setTest set

6 databases6 databases Ircam Studio OnLine Ircam Studio OnLine

(1323 sounds, 16 instruments), (1323 sounds, 16 instruments), Iowa University database Iowa University database

(816 sounds, 12 instruments), (816 sounds, 12 instruments), McGill University database McGill University database

(585 sounds, 23 instruments), (585 sounds, 23 instruments), Microsoft “Musical Instruments” CD-ROM Microsoft “Musical Instruments” CD-ROM

(216 sounds, 20 instruments),(216 sounds, 20 instruments), two commercial databases Pro two commercial databases Pro

(532 sounds, 20 instruments) Vi databases (532 sounds, 20 instruments) Vi databases (691 sounds, 18 instruments),(691 sounds, 18 instruments),

total = 4163 sounds. total = 4163 sounds.

notes:notes: 27 instrument have been considered27 instrument have been considered a large pitch range has been considered a large pitch range has been considered

(4 octaves on average)(4 octaves on average) no muted, martele/staccato soundsno muted, martele/staccato sounds


1) Random 66%/33% partition of database 1) Random 66%/33% partition of database (50 sets)(50 sets)

2) One to One (O2O) 2) One to One (O2O) [Livshin2003]: [Livshin2003]: each databaseeach database is used in turns is used in turns to classify to classify all other databasesall other databases

3) Leave One Database Out (LODO) 3) Leave One Database Out (LODO) [Livshin [Livshin 2003]: 2003]: all database except oneall database except one are used in turns are used in turnsto classify to classify the remaining onethe remaining one

DB1 DB2 DB3 DB4 DB5 DB6DB1DB2DB3DB4DB5DB6


DB1

DB1

6. Evaluation6. EvaluationEvaluation process Evaluation process 6. Evaluation6. EvaluationEvaluation process Evaluation process


T1 T2 T3F-GC 89 57 30H-GC 93 63 38

6. Evaluation6. EvaluationResults O2O (II) Results O2O (II) 6. Evaluation6. EvaluationResults O2O (II) Results O2O (II)


T1 T2 T3LDA 96 89 86CFS weka 99.0 (0.5) 93.2 (0.8) 60.8 (12.9)IRMFSP (t=0.01, nbdescmax=20) 99.2 (0.4) 95.8 (1.2) 95.1 (1.2)

DB1

DB1

T1 T2 T3F-GC 98 78 55F-GC (BC+LDA) 99 81 54F-KNN (K=10, LDA)99 77 51

H-GC 98 80 57H-GC (BC+LDA) 99 85 64H-KNN (K=10, LDA)99 84 64

BERT 95 65 42C4.5. 65 48PART 71 42



6. Evaluation6. EvaluationResults O2O (II) Results O2O (II) 6. Evaluation6. EvaluationResults O2O (II) Results O2O (II)

O2O (mean value over the 30 (6*5) experiments)O2O (mean value over the 30 (6*5) experiments) DiscussionDiscussion

low recognition rate for O2O compared to 66%/33% low recognition rate for O2O compared to 66%/33% -> problem of generalization ? -> problem of generalization ?

system mainly learns the instrument instance instead of the system mainly learns the instrument instance instead of the instrument (each database contains a single instance of an instrument (each database contains a single instance of an instrument)instrument)

LODO (mean value over the 6 Left Out databases)LODO (mean value over the 6 Left Out databases) Goal: to increase the number of instances of each instrument Goal: to increase the number of instances of each instrument How: by combining several databasesHow: by combining several databases





pia

no

guitar

harp

vio

la-p

izz

bass-p

izz

cello-p

izz

vio

lin-p

izz

vio

la

bass

cello

vio

lin

french-h

orn

corn

et

trom

bone

trum

pet

tuba

flute

pic

colo

record

er

bassoon

cla

rinet

english-h

orn

oboe

piano 36 3 4 4 2 2 1 1 2guitar 29 48 12 1 8harp 24 22 68 2 3 5 2viola-pizz 1 6 85 1 9 4 2bass-pizz 1 3 76 12cello-pizz 2 20 2 4 18 71 1 1violin-pizz 3 1 6 1 88 2 4 3

viola 44 5 14bass 2 93 4 6 1 2 1 3cello 1 37 5 68 16 1 2 1 1 5violin 14 3 55 1 10 1french-horn 1 1 50 13 1 15 4 5 2cornet 2 1 30 3 13 2 1 4trombone 15 15 49 7 1 1 2trumpet 1 47 10 61 3 2 2tuba 2 2 23 7 79flute 1 2 5 2 3 4 1 77 10 10 1 23 2 4piccolo 1 4 71 5 5 8recorder 1 2 4 59bassoon 1 4 5 1 2 2 12 3 81 12 1clarinet 1 1 2 1 7 2 4 5 1 10 46 10 20english-horn 1 1 1 3 3 1 12 4oboe 4 1 4 9 3 1 3 1 14 49 58number of sounds 146 159 130 54 186 170 97 225 280 356 264 242 53 202 157 140 323 83 39 203 212 41 184

original class

recogniz

ed c

lass

5. Evaluation5. EvaluationConfusion matrix Confusion matrix 5. Evaluation5. EvaluationConfusion matrix Confusion matrix

Low confusion between sustained / non-sustained Low confusion between sustained / non-sustained soundssounds


pia

no

guitar

harp

vio

la-p

izz

bass-p

izz

cello-p

izz

vio

lin-p

izz

vio

la

bass

cello

vio

lin

french-h

orn

corn

et

trom

bone

trum

pet

tuba

flute

pic

colo

record

er

bassoon

cla

rinet

english-h

orn

oboe



original class

recogniz

ed c

lass


Largest confusions inside each instrument familyLargest confusions inside each instrument family


pia

no

guitar

harp

vio

la-p

izz

bass-p

izz

cello-p

izz

vio

lin-p

izz

vio

la

bass

cello

vio

lin

french-h

orn

corn

et

trom

bone

trum

pet

tuba

flute

pic

colo

record

er

bassoon

cla

rinet

english-h

orn

oboe



original class

recogniz

ed c

lass


Lowest recognition rates -> smallest training setsLowest recognition rates -> smallest training sets


pia

no

guitar

harp

vio

la-p

izz

bass-p

izz

cello-p

izz

vio

lin-p

izz

vio

la

bass

cello

vio

lin

french-h

orn

corn

et

trom

bone

trum

pet

tuba

flute

pic

colo

record

er

bassoon

cla

rinet

english-h

orn

oboe



original class

recogniz

ed c

lass


Confusion piano / guitar-harpConfusion piano / guitar-harp


pia

no

guitar

harp

vio

la-p

izz

bass-p

izz

cello-p

izz

vio

lin-p

izz

vio

la

bass

cello

vio

lin

french-h

orn

corn

et

trom

bone

trum

pet

tuba

flute

pic

colo

record

er

bassoon

cla

rinet

english-h

orn

oboe



original class

recogniz

ed c

lass


Cross-family confusions Cross-family confusions


5. Evaluation5. EvaluationConfusion matrixConfusion matrix5. Evaluation5. EvaluationConfusion matrixConfusion matrix

Cross-family confusionsCross-family confusions

Cornet Cornet -> Bassoon-> Bassoon

Cornet Cornet -> English-horn -> English-horn

Flute Flute -> Clarinet-> Clarinet

Oboe Oboe -> Flute-> Flute

Trombone Trombone -> Flute-> Flute





sust./non-susts among non-sust. among sust. among bow ed-string among brass among air reeds among sing/dbl reeds

temporal increase temporal decrease temporal decrease temporal decrease

temporal decrease temporal centroid

temporal log-attack

spectral centroid spectral centroid spectral spread spectral centroid spectral centroid spectral skew ness spectral centroid

spectral spread spectral spread spectral skew ness spectral spread spectral skew ness spectral kurtosis + std spectral spread

spectral skew ness spectral kurtosis + std sharpness spectrall kurtosis std spectral slope spectral skew ness

spectral variation spectrall skew ness std spectral variation std

spectral decrease std spectral kurtosis

harmonic deviation harmonic deviation tristimulus noisiness harmonic deviation tristimulus

tristimuls std harmonic deviation

mfcc2,6 std various mfcc mfcc3,4,6 xcorr 3, 6, 8 xcorr3 xcorr3

5. Evaluation5. EvaluationMain selected features Main selected features 5. Evaluation5. EvaluationMain selected features Main selected features

Par FSA (IRMFSP)Par FSA (IRMFSP)


DTg_decr <= 10.033592

| DPi_specloud_m17-mm <= 0.013381

| | DSi_sc_v4-ss <= 0.164903

| | | DPi_specloud_m5-mm <= 0.0124: htb (18.0/11.0)

| | | DPi_specloud_m5-mm > 0.0124

| | | | DPi_sc_v2-mm <= 443.455871

| | | | | DPi_ss_v1-mm <= 477.501186

| | | | | | DPi_loud_v-mm <= 5.759929: cb-pizz (10.0/6.0)

| | | | | | DPi_loud_v-mm > 5.759929

| | | | | | | DTi_xcorr_m11-mm <= -0.272094: cor (50.0/7.0)

| | | | | | | DTi_xcorr_m11-mm > -0.272094

| | | | | | | | DPg_flustr_v7 <= 0.006614: tubb (10.0/5.0)

| | | | | | | | DPg_flustr_v7 > 0.006614

| | | | | | | | | DPi_Dmfcc_m3-mm <= -0.013356: cor (19.0/7.0)

| | | | | | | | | DPi_Dmfcc_m3-mm > -0.013356: tubb (67.0/3.0)

5. Evaluation5. EvaluationMain selected featuresMain selected features5. Evaluation5. EvaluationMain selected featuresMain selected features

Par arbre de décision (C4.5)Par arbre de décision (C4.5)

DTg_decr > 10.033592

| DTg_incr <= -0.744688

| | DPi_tri_v7-mm <= 0.035614

| | | DPg_roughn_v4 <= 0.120563

| | | | DTg_ed <= 0.278571

| | | | | DSi_skew_v6-mm <= -1.673777: vln-pizz (53.0)

| | | | | DSi_skew_v6-mm > -1.673777: alto-pizz (34.0/9.0)

| | | | DTg_ed > 0.278571: harp (10.0/4.0)

| | | DPg_roughn_v4 > 0.120563: picc (11.0/7.0)

| | DPi_tri_v7-mm > 0.035614


DTg_incr <= -1.670978 AND

DTg_lat <= -0.982531 AND

DPi_specloud_m1-mm <= 0.012608 AND

DSi_variation_v1-mm > 0.001828 AND

DSi_kurto_v6-mm > 6.786784: vln-pizz (82.0/1.0)

DPi_ss_v4-mm > 0.897333 AND

DHi_devs_v3-mm > 2.790707 AND

DHi_oeratio_v1-mm > 2.250247: clsb (74.0/5.0)

DPi_ss_v4-mm > 0.950127 AND

DPi_DDmfcc_m7-ss > 0.009458 AND

DPg_roughn_v6 > 0.079858 AND

DHg_mod_am > 0.000158 AND

DPi_specloud_m21-mm > 0.026443 AND

DPi_specloud_m5-mm <= 0.114309 AND

DPi_DDmfcc_m3-mm > -0.000202: vln (66.0/8.0)

5. Evaluation5. EvaluationMain selected featuresMain selected features5. Evaluation5. EvaluationMain selected featuresMain selected features

Par arbre de décision, décision regroupée (PART)Par arbre de décision, décision regroupée (PART)





7. Instrument Class Similarity7. Instrument Class Similarity7. Instrument Class Similarity7. Instrument Class Similarity

Goal:Goal: check that the proposed tree check that the proposed tree

structure corresponds to natural structure corresponds to natural class organization class organization

How ?How ? Most people use Martin hierarchyMost people use Martin hierarchy 1) check the grouping among the 1) check the grouping among the

decision trees leavesdecision trees leaves 2) MDS ?2) MDS ?

Gui

tar

Pian

o

Viol

in p

izz

Trum

pet

Flut

e

Har

p

Viol

a pi

zz

Cello

piz

z

Dou

ble

pizz

Viol

in

Viol

a

Cello

Dou

ble

Corn

et

Trom

bone

Fren

ch H

orn

Tuba

Picc

olo

Reco

rder

GuitarHarp

Strings Woodwinds

Non Sustained

Instrument

Sustained

Struck Strings Plucked Strings Pizz Strings

Piano ViolinViolaCello

Double

Bowed Strings BrassSingle Double

ReedsAir Reeds

ViolinViolaCello

Double

TrumpetCornet

TromboneFrench Horn

Tuba

Single ReedsClarinet

Tenor saxAlto saxSop sax

AccordeonDouble Reeds

OboeBassoon

English horn

FlutePiccolo

Recorder

T1

T2

T3

??

MDS on acoustic features ? MDS on acoustic features ? [Herrera AES114th] [Herrera AES114th] Compute the dissimilarity between each class Compute the dissimilarity between each class How ?Compute the between-group F-matrix between class modelsHow ?Compute the between-group F-matrix between class models Observe the dissimilarity between the classesObserve the dissimilarity between the classes How ? MDS (Multi-dimensional scaling) analysisHow ? MDS (Multi-dimensional scaling) analysis

MDS preserve as much as possible distances between the dataMDS preserve as much as possible distances between the dataand allows representing them into a lower dimensional spaceand allows representing them into a lower dimensional space

usually MDS is used for representing dissimilarity judgements (Timbre similarity), usually MDS is used for representing dissimilarity judgements (Timbre similarity), used here on acoustic featuresused here on acoustic features

MDS (Kruskal’s STRESS formula 1 scaling method)MDS (Kruskal’s STRESS formula 1 scaling method) 3 dimensional space3 dimensional space



Clusters ?Clusters ? non-sustained soundsnon-sustained sounds

PIAN PianoGUI GuitarHARP HarpVLNP Violin pizzVLAP Viola pizzCELLP Cello pizzDBLP Double pizz

VLN Violin pizzVLA Viola pizzCELL Cello pizzDBL Double pizzTRPU TrumpetCOR CornetTBTB TromboneFHOR French-hornTUBB TubaFLTU FlutePICC PiccoloRECO RecorderCLA ClarinetSAXTE Tenor saxSAXAL Alto saxSAXSO Soprano saxACC AccordeonOBOE OboeBS BassoonEHOR English-horn



Clusters ?Clusters ? non-sustained soundsnon-sustained sounds Bowed-strings soundsBowed-strings sounds





Clusters ?Clusters ? non-sustained soundsnon-sustained sounds Bowed-strings soundsBowed-strings sounds Brass sounds (TRPU ?)Brass sounds (TRPU ?)







Clusters ?Clusters ? non-sustained soundsnon-sustained sounds Bowed-strings soundsBowed-strings sounds Brass sounds (TRPU ?)Brass sounds (TRPU ?) mix between single/double mix between single/double

reeds and brass reeds and brass instrumentsinstruments



Dimension 1:Dimension 1: separate sustained sounds / separate sustained sounds /

non sustained soundsnon sustained sounds negative values: PIAN, GUI, negative values: PIAN, GUI,

HARP, VLNP, VLAP, CELLP, HARP, VLNP, VLAP, CELLP, DBLPDBLP

-> attack-time, decrease time-> attack-time, decrease time






non sustained soundsnon sustained sounds negative values: negative values:

PIAN, GUI, HARP, VLNP, VLAP, PIAN, GUI, HARP, VLNP, VLAP, CELLP, DBLPCELLP, DBLP


Dimension 2Dimension 2:: brightnessbrightness dark sounds:dark sounds:

TUBB, BSN, TBTB, FHORTUBB, BSN, TBTB, FHOR bright sounds: bright sounds:

PICC, CLA, FLUTPICC, CLA, FLUT problem DBL ?problem DBL ?






non sustained soundsnon sustained sounds negative values: negative values:

PIAN, GUI, HARP, VLNP, VLAP, PIAN, GUI, HARP, VLNP, VLAP, CELLP, DBLPCELLP, DBLP


Dimension 2Dimension 2:: brightnessbrightness dark sounds dark sounds

TUBB, BSN, TBTB, FHORTUBB, BSN, TBTB, FHOR bright sounds: bright sounds:

PICC, CLA, FLUTPICC, CLA, FLUT problem DBL ?problem DBL ?

Dimension 3Dimension 3:: ?? Separation of bowed stings Separation of bowed stings

(VLN, VLA, CELL, DBL)(VLN, VLA, CELL, DBL) amount of modulation ?amount of modulation ?




Conclusion ?Conclusion ?


ConclusionConclusionConclusionConclusion

State of the artState of the art Martin [1999] Martin [1999] 76% (family)76% (family) 39% for 14 instruments39% for 14 instruments Eronen [2001] Eronen [2001] 77% (family)77% (family) 35% for 16 instruments35% for 16 instruments

This studyThis study 85% (family)85% (family) 64% for 23 64% for 23

instrumentsinstruments increased recognition rates mainly explained by the use of new featuresincreased recognition rates mainly explained by the use of new features

PerspectivesPerspectives derive automatically the tree structure (analysis of decision tree ?)derive automatically the tree structure (analysis of decision tree ?) test other classification algorithm (GMM, SVM, …)test other classification algorithm (GMM, SVM, …) test the system for other sound classes (non-instrumental sounds, sound FX)test the system for other sound classes (non-instrumental sounds, sound FX) extend the system to musical phrasesextend the system to musical phrases extend the system to polyphonic soundsextend the system to polyphonic sounds extend the system to multi-sources soundsextend the system to multi-sources sounds

Links:Links: http://www.cuidado.muhttp://www.cuidado.muhttp://www.cs.waikato.ac.nz/ml/weka/http://www.cs.waikato.ac.nz/ml/weka/

Documents

Description et Classification automatique des sons instrumentaux Geoffroy Peeters Ircam (Analysis/Synthesis Team) [email protected]