Upload
kiana-noss
View
237
Download
0
Tags:
Embed Size (px)
Citation preview
Description et Description et Classification automatique Classification automatique des sons instrumentaux des sons instrumentaux
Geoffroy PeetersGeoffroy Peeters Ircam (Analysis/Synthesis Team) Ircam (Analysis/Synthesis Team)
trumpettrumpet
1. Introduction1. Introduction1. Introduction1. Introduction
Musical Instrument Sound ClassificationMusical Instrument Sound Classification numerous studies on sound classificationnumerous studies on sound classification few of them address the problem of generalization of sound few of them address the problem of generalization of sound
sources sources (recognition of the same source possibly recorded in different (recognition of the same source possibly recorded in different conditions with various instrument manufacturers and players)conditions with various instrument manufacturers and players)
Evaluation of the system performanceEvaluation of the system performance training on a subset of the database, evaluation on the rest of training on a subset of the database, evaluation on the rest of
the databasethe database does not prove any applicability for the classification of sounds does not prove any applicability for the classification of sounds
which do not belong to the databasewhich do not belong to the database
Martin [1999] Martin [1999] 76% (family)76% (family) 39% for 14 instruments39% for 14 instruments Eronen [2001] Eronen [2001] 77% (family)77% (family) 35% for 16 instruments35% for 16 instruments
Goal of this studyGoal of this study study large database classificationstudy large database classification How ? New classification systemHow ? New classification system Extract a large amount of featuresExtract a large amount of features New feature selection algorithmNew feature selection algorithm Compare flat and hierarchical gaussian classifierCompare flat and hierarchical gaussian classifier
Feature extractionFeature extractionFeature selectionFeature selectionFeature TransformFeature TransformClassificationClassificationEvaluationEvaluation
Confusion matrixConfusion matrixWhich featuresWhich featuresClasses organizationClasses organization
temporal modeling
meanvariance
derivativemodulationpolynomial
Instantaneous (frame based) features- harmonic features- spectral shape features- perceptual features- MFCC, xcorr, zcr- MPEG-7 LLDs (spectral flatness, crest)
global features (attack time, increase/decrease)
2. Feature extraction2. Feature extraction2. Feature extraction2. Feature extraction
FeaturesExtraction
TemporalModeling
FeatureTransform:Gaussianity
FeatureSelectionIRMFSP
FeatureTransform
LDA
Classmodeling
Features for sound recognition:Features for sound recognition: speech recognition community, speech recognition community,
previous studies on musical previous studies on musical instrument sounds classification, instrument sounds classification, results of psycho-acoustical results of psycho-acoustical studiesstudies..
each feature set is supposed to each feature set is supposed to perform well for a specific taskperform well for a specific task
Principle:Principle: 1) extract a large set of features 1) extract a large set of features 2) filter the feature set a 2) filter the feature set a
posteriori by a Feature Selection posteriori by a Feature Selection AlgorithmAlgorithm
Whole set offeatures
Feature selectionalgorithm
Reduced set offeatures
Classes
A B C D E F G H I J K L M N ...
C K N
SignalDescriptorsExtraction
Module
InstantaneousDescriptors
GlobalDescriptors
TemporalModeling Descriptors
FundamentalFrequency
Segmentation
2. Feature extraction 2. Feature extraction Audio features TaxonomyAudio features Taxonomy 2. Feature extraction 2. Feature extraction Audio features TaxonomyAudio features Taxonomy
Global descriptorsGlobal descriptors Instantaneous descriptorsInstantaneous descriptors
Temporal modelingTemporal modeling Mean, Mean, VarianceVariance Modulation (pitch, energy)Modulation (pitch, energy)
FFT
SinusoidalHarmonic
ModelSignal frame
PerceptualModel
Signal
InstantaneousTemporal
Descriptors
InstantaneousSpectral
Descriptors
InstantaneousPerceptualDescriptors
InstantaneousHarmonic
Descriptors
GlobalTemporal
Descriptors
EnergyEnvelop
2. Feature extraction 2. Feature extraction Audio features TaxonomyAudio features Taxonomy2. Feature extraction 2. Feature extraction Audio features TaxonomyAudio features Taxonomy
DT: temporal descriptorsDT: temporal descriptors DE: energy descriptorsDE: energy descriptors DS: spectral descriptorsDS: spectral descriptors DH: harmonic descriptorsDH: harmonic descriptors DP: perceptual descriptorsDP: perceptual descriptors
2. Feature extraction 2. Feature extraction DT/DE: Temporal/Energy descriptorsDT/DE: Temporal/Energy descriptors2. Feature extraction 2. Feature extraction DT/DE: Temporal/Energy descriptorsDT/DE: Temporal/Energy descriptors
soundsound EnergyEnergy EnvelopEnvelop
DT.zero-crossing rateDT.zero-crossing rate DT.auto-correlationDT.auto-correlation
DT.log-attack timeDT.log-attack time DT.temporal increaseDT.temporal increase DT.temporal decreaseDT.temporal decrease DT.temporal centroid DT.temporal centroid DT.effective durationDT.effective duration
DE.total energyDE.total energy DE.energy of harmonic partDE.energy of harmonic part DE.energy of noise partDE.energy of noise part
2. Feature extraction 2. Feature extraction DS: Spectral descriptorsDS: Spectral descriptors2. Feature extraction 2. Feature extraction DS: Spectral descriptorsDS: Spectral descriptors
soundsound WindowWindow FFTFFT
DS.centroid, DS.spread, DS.skewness, DS.kurtosisDS.centroid, DS.spread, DS.skewness, DS.kurtosis DS.slope, DS.decrease, DS.roll-offDS.slope, DS.decrease, DS.roll-off DS.variationDS.variation
2. Feature extraction 2. Feature extraction DH: Harmonic descriptorsDH: Harmonic descriptors2. Feature extraction 2. Feature extraction DH: Harmonic descriptorsDH: Harmonic descriptors
DH.Centroid, DH.Spread, DH.Skewness, DH.KurtosisDH.Centroid, DH.Spread, DH.Skewness, DH.Kurtosis DH.Slope, DH.Decrease, DH.Roll-offDH.Slope, DH.Decrease, DH.Roll-off DH.VariationDH.Variation
DH.Fundamental frequencyDH.Fundamental frequency DH.Noisiness, DH.OddEvenRatio, DH.InharmonicityDH.Noisiness, DH.OddEvenRatio, DH.Inharmonicity DH.TristimulusDH.Tristimulus DH.DeviationDH.Deviation
soundsound WindowWindow FFTFFT Sinudoidal modelSinudoidal model
DP.Centroid, DP.Spread, DP.Skewness, DP.KurtosisDP.Centroid, DP.Spread, DP.Skewness, DP.Kurtosis DP.Slope, DP.Decrease, DP.Roll-offDP.Slope, DP.Decrease, DP.Roll-off DP.VariationDP.Variation
DP.Loudness, RelativeSpecific LoudnessDP.Loudness, RelativeSpecific Loudness DP.Sharpness, DP.SpreadDP.Sharpness, DP.Spread DP.Roughness, DP.FluctuationStrengthDP.Roughness, DP.FluctuationStrength
DV.MFCC, DV.Delta-MFCC, DV.Delta-Delta-MFCCDV.MFCC, DV.Delta-MFCC, DV.Delta-Delta-MFCC DV.SpectralFlatness, DV.SpectralCrestDV.SpectralFlatness, DV.SpectralCrest
soundsound WindowWindow FFTFFT PerceptionPerception
Mid-ear fileringMid-ear filering Bark scaleBark scale Mel scaleMel scale
2. Feature extraction 2. Feature extraction DP: Perceptual descriptors / DV: Various descriptorsDP: Perceptual descriptors / DV: Various descriptors2. Feature extraction 2. Feature extraction DP: Perceptual descriptors / DV: Various descriptorsDP: Perceptual descriptors / DV: Various descriptors
2. Feature extraction 2. Feature extraction Audio features designAudio features design2. Feature extraction 2. Feature extraction Audio features designAudio features design
No consensus on the use of amplitude and frequency No consensus on the use of amplitude and frequency scalescale All features are computed using the following scale:All features are computed using the following scale: Frequency scale: Frequency scale: linear / log / bark-bandslinear / log / bark-bands Amplitude scale: Amplitude scale: linear / power / loglinear / power / log
note: log(0.0)=-inftynote: log(0.0)=-infty -> normalization 24bits-> normalization 24bits
Features must be independent of the Features must be independent of the recording levelrecording level
Normalization in linear, in power scaleNormalization in linear, in power scale
Normalization in logarithmic scaleNormalization in logarithmic scale
Features must be independent of the Features must be independent of the sampling ratesampling rate Maximum frequency taken into account: 11025/2 HzMaximum frequency taken into account: 11025/2 Hz Resampling (for zcr, xcorr)Resampling (for zcr, xcorr)
a
fa
a
fasc
2
*2*
2
2
2
2
)2(
*)2(*
a
fa
a
fasc
a
a
fa
a
sc
log
*log
Feature extractionFeature extractionFeature selectionFeature selectionFeature TransformFeature TransformClassificationClassificationEvaluationEvaluation
Confusion matrixConfusion matrixWhich featuresWhich featuresClasses organizationClasses organization
Whole set offeatures
Feature selectionalgorithm
Reduced set offeatures
Classes
A B C D E F G H I J K L M N ...
C K N
FeaturesExtraction
TemporalModeling
FeatureTransform:Gaussianity
FeatureSelectionIRMFSP
FeatureTransform
LDA
Classmodeling
3. Feature selection algorithm (FSA)3. Feature selection algorithm (FSA)3. Feature selection algorithm (FSA)3. Feature selection algorithm (FSA)
Problem: using a high number of featuresProblem: using a high number of features some features can be irrelevant for the given tasksome features can be irrelevant for the given task over fitting of the model to the training set (especially with LDA)over fitting of the model to the training set (especially with LDA) classification models are difficult to interpret by humanclassification models are difficult to interpret by human
Goal of feature selection algorithm (FTA)Goal of feature selection algorithm (FTA) find the minimal set offind the minimal set of
criterion 1) informative features with respect to the classescriterion 1) informative features with respect to the classescriterion 2) features that provide non redundant informationcriterion 2) features that provide non redundant information
Forms of feature selection algorithmForms of feature selection algorithm embedded: embedded: the FSA is part of the classifierthe FSA is part of the classifier filter: filter: the FSA is distinct from the classifier the FSA is distinct from the classifier
and used before the classifierand used before the classifier wrapper: wrapper: the FSA makes use of the classification resultsthe FSA makes use of the classification results
3. Feature selection algorithm: IRMFSP3. Feature selection algorithm: IRMFSP3. Feature selection algorithm: IRMFSP3. Feature selection algorithm: IRMFSP
Criterion 1 Criterion 1 informative features with respect informative features with respect to the classesto the classes
principleprinciple: “feature values for sounds : “feature values for sounds belonging to a specific class should be belonging to a specific class should be separated from the values for all the separated from the values for all the other classes »other classes »
measuremeasure: for a specific feature : for a specific feature ii ratio of ratio of the Between-class inertia B to the Total the Between-class inertia B to the Total class inertia Tclass inertia T
Criterion 2 Criterion 2 features that provide non redundant features that provide non redundant informationinformation
apply an orthogonalization process of the apply an orthogonalization process of the feature space after the selection of each new feature space after the selection of each new feature feature (Gram-Schmidt Orthogonalization)(Gram-Schmidt Orthogonalization)
N
niniini
K
kikiiki
k
mfmfN
mmmmNN
T
Br
1,,
1,,
)')((1
)')((
iii ffg / Fjggfff iijjj )('
Inertia Ratio Maximization using Feature Space Projection
whole set of feature
compute inertia ratio for allfeatures
take the feature with largest ratio
project the whole feature spaceon the selected feature
W 1
W 2
m-m 2
m-m1
m
m 1
m 2
f i
f i+1
3. Feature selection algorithm: IRMFSP3. Feature selection algorithm: IRMFSP3. Feature selection algorithm: IRMFSP3. Feature selection algorithm: IRMFSP
Example :Example :sustained/non-sustained sound separationsustained/non-sustained sound separation computation of the BT ratio for each featurecomputation of the BT ratio for each feature
feature with the weakest ratio (r=6.9e-6) feature with the weakest ratio (r=6.9e-6) Specific loudness m8 meanSpecific loudness m8 mean
feature with the highest ratio (r=0.58)feature with the highest ratio (r=0.58) Energy temporal decreaseEnergy temporal decrease
first three selected dimensions first three selected dimensions 1st dim: 1st dim: temporal decreasetemporal decrease 2nd dim: 2nd dim: spectral centroidspectral centroid 3rd dim: 3rd dim: temporal increasetemporal increase
Feature extractionFeature extractionFeature selectionFeature selectionFeature TransformFeature TransformClassificationClassificationEvaluationEvaluation
Confusion matrixConfusion matrixWhich featuresWhich featuresClasses organizationClasses organization
Linear Discriminant AnalysisLinear Discriminant Analysis find linear combination among features in order to maximize find linear combination among features in order to maximize
discrimination between classes: F -> F’discrimination between classes: F -> F’
Total inertiaTotal inertia
Between Class InertiaBetween Class Inertia
Transform initial feature space Transform initial feature space FF by a transformation matrix by a transformation matrix UU
in order to maximize the ratio in order to maximize the ratio
Solution: Solution: eigen vectors of eigen vectors of associated to the eigen values associated to the eigen values
(discriminative power) (discriminative power)
FeaturesExtraction
TemporalModeling
FeatureTransform:Gaussianity
FeatureSelectionIRMFSP
FeatureTransform
LDA
Classmodeling
4. Feature transformation: LDA4. Feature transformation: LDA4. Feature transformation: LDA4. Feature transformation: LDA
n
iii mdmd
nT
1
)')((1
K
kkk
k mmmmn
nB
1
)')((
uTu
uBuru '
'
BT 1
Feature extractionFeature extractionFeature selectionFeature selectionFeature TransformFeature TransformClassificationClassificationEvaluationEvaluation
Confusion matrixConfusion matrixWhich featuresWhich featuresClasses organizationClasses organization
top
node j-1 node jnodej+1
...
feature selectionbest set of features f1,f 2,...,F N ?
feature transformationLinear Discriminant Analysis matrix ?
for each classgaussian pdf parameters estimation
feature selectionuse only f 1,f 2,...,F N
feature transformationapply matrix
for each classevaluate Bayes formula
TRAINING
EVALUATION
f i
f i+1
FeaturesExtraction
TemporalModeling
FeatureTransform:Gaussianity
FeatureSelectionIRMFSP
FeatureTransform
LDA
Classmodeling
5. Class modeling: 5. Class modeling: flat classifiersflat classifiers5. Class modeling: 5. Class modeling: flat classifiersflat classifiers
Flat classifiersFlat classifiers Flat gaussian classifier (F-GC)Flat gaussian classifier (F-GC) ““Flat”= all classes considered on a same levelFlat”= all classes considered on a same level
Training: model each class k by a multi-dimensional Training: model each class k by a multi-dimensional gaussian pdf (mean vector, covariance matrix)gaussian pdf (mean vector, covariance matrix)
Evaluation: Bayes formulaEvaluation: Bayes formula
Flat KNN classifier (F-KNN)Flat KNN classifier (F-KNN) instance-based algorithminstance-based algorithm assign to the input sound the majority class among its K assign to the input sound the majority class among its K
Nearest Neighbors in the Feature SpaceNearest Neighbors in the Feature Space Euclidean distance => weighting of the axes ?Euclidean distance => weighting of the axes ?
Apply to the output of the LDA (implicit weighting of the axes)Apply to the output of the LDA (implicit weighting of the axes)
top
node j-1 node jnodej+1
...
feature selectionbest set of features f1,f 2,...,F N ?
feature transformationLinear Discriminant Analysis matrix ?
for each classgaussian pdf parameters estimation
feature selectionuse only f 1,f 2,...,F N
feature transformationapply matrix
for each classevaluate Bayes formula
TRAINING
EVALUATION
node i
node j-1 node jnodej+1 ...
top
......
TRAINING
EVALUATION
FeaturesExtraction
TemporalModeling
FeatureTransform:Gaussianity
FeatureSelectionIRMFSP
FeatureTransform
LDA
Classmodeling
5. Class modeling: 5. Class modeling: hierarchical classifiershierarchical classifiers5. Class modeling: 5. Class modeling: hierarchical classifiershierarchical classifiers
Hierarchical classifiers (F-GC)Hierarchical classifiers (F-GC) Hierarchical gaussian classifier (H-GC)Hierarchical gaussian classifier (H-GC)
Training: a tree of flat gaussian classifierTraining: a tree of flat gaussian classifiereach node has its own FSA, FTA and each node has its own FSA, FTA and
F-GCF-GC Tree construction is supervised (>< decision tree)Tree construction is supervised (>< decision tree) Only the subset of sounds belonging to the classes of the Only the subset of sounds belonging to the classes of the
current node are usedcurrent node are used
Evaluation: local probability decides which branch of the Evaluation: local probability decides which branch of the tree to followtree to follow
Advantages of H-GCAdvantages of H-GC Learning facilities: it is easier to learn differences in a small Learning facilities: it is easier to learn differences in a small
subset of classessubset of classes Reduced class confusion: benefit from the higher recognition Reduced class confusion: benefit from the higher recognition
rate at the higher levels of the treerate at the higher levels of the tree
Hierarchical KNN classifier (H-KNN)Hierarchical KNN classifier (H-KNN)
FeaturesExtraction
TemporalModeling
FeatureTransform:Gaussianity
FeatureSelectionIRMFSP
FeatureTransform
LDA
Classmodeling
5. Class modeling: 5. Class modeling: hierarchical classifiershierarchical classifiers5. Class modeling: 5. Class modeling: hierarchical classifiershierarchical classifiers
Hierarchical classifiers (F-GC)Hierarchical classifiers (F-GC) Hierarchical gaussian classifier (H-GC)Hierarchical gaussian classifier (H-GC)
Training: a tree of flat gaussian classifierTraining: a tree of flat gaussian classifiereach node has its own FSA, FTA and each node has its own FSA, FTA and
F-GCF-GC Tree construction is supervised (>< decision tree)Tree construction is supervised (>< decision tree) Only the subset of sounds belonging to the classes of the Only the subset of sounds belonging to the classes of the
current node are usedcurrent node are used
Evaluation: local probability decides which branch of the Evaluation: local probability decides which branch of the tree to followtree to follow
Advantages of H-GCAdvantages of H-GC Learning facilities: it is easier to learn differences in a small Learning facilities: it is easier to learn differences in a small
subset of classessubset of classes Reduced class confusion: benefit from the higher recognition Reduced class confusion: benefit from the higher recognition
rate at the higher levels of the treerate at the higher levels of the tree
Hierarchical KNN classifier (H-KNN)Hierarchical KNN classifier (H-KNN)
Decision Trees: Decision Trees: Binary Entropy Reduction Tree (BERT)Binary Entropy Reduction Tree (BERT) C4.5.C4.5. Partial Decision Tree (PART)Partial Decision Tree (PART)
top
node j-1 node jnodej+1
...
feature selectionbest set of features f1,f 2,...,F N ?
feature transformationLinear Discriminant Analysis matrix ?
for each classgaussian pdf parameters estimation
feature selectionuse only f 1,f 2,...,F N
feature transformationapply matrix
for each classevaluate Bayes formula
TRAINING
EVALUATION
node i
node j-1 node jnodej+1 ...
top
......
TRAINING
EVALUATION
Feature extractionFeature extractionFeature selectionFeature selectionFeature TransformFeature TransformClassificationClassificationEvaluationEvaluation
Confusion matrixConfusion matrixWhich featuresWhich featuresClasses organizationClasses organization
GuitarHarp
Strings Woodwinds
Non Sustained
Instrument
Sustained
Struck Strings Plucked Strings Pizz Strings
Piano ViolinViolaCello
Double
Bowed Strings BrassSingle Double
ReedsAir Reeds
ViolinViolaCello
Double
TrumpetCornet
TromboneFrench Horn
Tuba
Single ReedsClarinet
Tenor saxAlto saxSop sax
AccordeonDouble Reeds
OboeBassoon
English horn
FlutePiccolo
Recorder
T1
T2
T3
6. Evaluation6. EvaluationTaxonomy usedTaxonomy used 6. Evaluation6. EvaluationTaxonomy usedTaxonomy used
Three different levelsThree different levels T1: sustained/non-sustained soundsT1: sustained/non-sustained sounds T2: instrument familiesT2: instrument families T3: instrument namesT3: instrument names
0
50
100
150
200
250
300
350
400
pian
o
guita
r
harp
viol
a-pi
zz
doub
le-p
izz
cello
-piz
z
viol
in-p
izz
viol
a
doub
le
cello
viol
in
fren
ch h
orn
corn
et
trom
bone
trum
pet
tuba
flute
picc
olo
reco
rder
acco
rdeo
n
bass
oon
clar
inet
engl
ish-
horn
oboe
saxs
op
saxa
lto
saxt
enor
Vi
Pro
Microsoft
McGill
Iowa
SOL
6. Evaluation6. EvaluationTest setTest set 6. Evaluation6. EvaluationTest setTest set
6 databases6 databases Ircam Studio OnLine Ircam Studio OnLine
(1323 sounds, 16 instruments), (1323 sounds, 16 instruments), Iowa University database Iowa University database
(816 sounds, 12 instruments), (816 sounds, 12 instruments), McGill University database McGill University database
(585 sounds, 23 instruments), (585 sounds, 23 instruments), Microsoft “Musical Instruments” CD-ROM Microsoft “Musical Instruments” CD-ROM
(216 sounds, 20 instruments),(216 sounds, 20 instruments), two commercial databases Pro two commercial databases Pro
(532 sounds, 20 instruments) Vi databases (532 sounds, 20 instruments) Vi databases (691 sounds, 18 instruments),(691 sounds, 18 instruments),
total = 4163 sounds. total = 4163 sounds.
notes:notes: 27 instrument have been considered27 instrument have been considered a large pitch range has been considered a large pitch range has been considered
(4 octaves on average)(4 octaves on average) no muted, martele/staccato soundsno muted, martele/staccato sounds
1) Random 66%/33% partition of database 1) Random 66%/33% partition of database (50 sets)(50 sets)
2) One to One (O2O) 2) One to One (O2O) [Livshin2003]: [Livshin2003]: each databaseeach database is used in turns is used in turns to classify to classify all other databasesall other databases
3) Leave One Database Out (LODO) 3) Leave One Database Out (LODO) [Livshin [Livshin 2003]: 2003]: all database except oneall database except one are used in turns are used in turnsto classify to classify the remaining onethe remaining one
DB1 DB2 DB3 DB4 DB5 DB6DB1DB2DB3DB4DB5DB6
DB1 DB2 DB3 DB4 DB5 DB6DB1DB2DB3DB4DB5DB6
DB1
DB1
6. Evaluation6. EvaluationEvaluation process Evaluation process 6. Evaluation6. EvaluationEvaluation process Evaluation process
T1 T2 T3F-GC 89 57 30H-GC 93 63 38
6. Evaluation6. EvaluationResults O2O (II) Results O2O (II) 6. Evaluation6. EvaluationResults O2O (II) Results O2O (II)
DB1 DB2 DB3 DB4 DB5 DB6DB1DB2DB3DB4DB5DB6
T1 T2 T3LDA 96 89 86CFS weka 99.0 (0.5) 93.2 (0.8) 60.8 (12.9)IRMFSP (t=0.01, nbdescmax=20) 99.2 (0.4) 95.8 (1.2) 95.1 (1.2)
DB1
DB1
T1 T2 T3F-GC 98 78 55F-GC (BC+LDA) 99 81 54F-KNN (K=10, LDA)99 77 51
H-GC 98 80 57H-GC (BC+LDA) 99 85 64H-KNN (K=10, LDA)99 84 64
BERT 95 65 42C4.5. 65 48PART 71 42
DB1 DB2 DB3 DB4 DB5 DB6DB1DB2DB3DB4DB5DB6
6. Evaluation6. EvaluationResults O2O (II) Results O2O (II) 6. Evaluation6. EvaluationResults O2O (II) Results O2O (II)
O2O (mean value over the 30 (6*5) experiments)O2O (mean value over the 30 (6*5) experiments) DiscussionDiscussion
low recognition rate for O2O compared to 66%/33% low recognition rate for O2O compared to 66%/33% -> problem of generalization ? -> problem of generalization ?
system mainly learns the instrument instance instead of the system mainly learns the instrument instance instead of the instrument (each database contains a single instance of an instrument (each database contains a single instance of an instrument)instrument)
LODO (mean value over the 6 Left Out databases)LODO (mean value over the 6 Left Out databases) Goal: to increase the number of instances of each instrument Goal: to increase the number of instances of each instrument How: by combining several databasesHow: by combining several databases
Feature extractionFeature extractionFeature selectionFeature selectionFeature TransformFeature TransformClassificationClassificationEvaluationEvaluation
Confusion matrixConfusion matrixWhich featuresWhich featuresClasses organizationClasses organization
pia
no
guitar
harp
vio
la-p
izz
bass-p
izz
cello-p
izz
vio
lin-p
izz
vio
la
bass
cello
vio
lin
french-h
orn
corn
et
trom
bone
trum
pet
tuba
flute
pic
colo
record
er
bassoon
cla
rinet
english-h
orn
oboe
piano 36 3 4 4 2 2 1 1 2guitar 29 48 12 1 8harp 24 22 68 2 3 5 2viola-pizz 1 6 85 1 9 4 2bass-pizz 1 3 76 12cello-pizz 2 20 2 4 18 71 1 1violin-pizz 3 1 6 1 88 2 4 3
viola 44 5 14bass 2 93 4 6 1 2 1 3cello 1 37 5 68 16 1 2 1 1 5violin 14 3 55 1 10 1french-horn 1 1 50 13 1 15 4 5 2cornet 2 1 30 3 13 2 1 4trombone 15 15 49 7 1 1 2trumpet 1 47 10 61 3 2 2tuba 2 2 23 7 79flute 1 2 5 2 3 4 1 77 10 10 1 23 2 4piccolo 1 4 71 5 5 8recorder 1 2 4 59bassoon 1 4 5 1 2 2 12 3 81 12 1clarinet 1 1 2 1 7 2 4 5 1 10 46 10 20english-horn 1 1 1 3 3 1 12 4oboe 4 1 4 9 3 1 3 1 14 49 58number of sounds 146 159 130 54 186 170 97 225 280 356 264 242 53 202 157 140 323 83 39 203 212 41 184
original class
recogniz
ed c
lass
5. Evaluation5. EvaluationConfusion matrix Confusion matrix 5. Evaluation5. EvaluationConfusion matrix Confusion matrix
Low confusion between sustained / non-sustained Low confusion between sustained / non-sustained soundssounds
pia
no
guitar
harp
vio
la-p
izz
bass-p
izz
cello-p
izz
vio
lin-p
izz
vio
la
bass
cello
vio
lin
french-h
orn
corn
et
trom
bone
trum
pet
tuba
flute
pic
colo
record
er
bassoon
cla
rinet
english-h
orn
oboe
piano 36 3 4 4 2 2 1 1 2guitar 29 48 12 1 8harp 24 22 68 2 3 5 2viola-pizz 1 6 85 1 9 4 2bass-pizz 1 3 76 12cello-pizz 2 20 2 4 18 71 1 1violin-pizz 3 1 6 1 88 2 4 3
viola 44 5 14bass 2 93 4 6 1 2 1 3cello 1 37 5 68 16 1 2 1 1 5violin 14 3 55 1 10 1french-horn 1 1 50 13 1 15 4 5 2cornet 2 1 30 3 13 2 1 4trombone 15 15 49 7 1 1 2trumpet 1 47 10 61 3 2 2tuba 2 2 23 7 79flute 1 2 5 2 3 4 1 77 10 10 1 23 2 4piccolo 1 4 71 5 5 8recorder 1 2 4 59bassoon 1 4 5 1 2 2 12 3 81 12 1clarinet 1 1 2 1 7 2 4 5 1 10 46 10 20english-horn 1 1 1 3 3 1 12 4oboe 4 1 4 9 3 1 3 1 14 49 58number of sounds 146 159 130 54 186 170 97 225 280 356 264 242 53 202 157 140 323 83 39 203 212 41 184
original class
recogniz
ed c
lass
5. Evaluation5. EvaluationConfusion matrix Confusion matrix 5. Evaluation5. EvaluationConfusion matrix Confusion matrix
Largest confusions inside each instrument familyLargest confusions inside each instrument family
pia
no
guitar
harp
vio
la-p
izz
bass-p
izz
cello-p
izz
vio
lin-p
izz
vio
la
bass
cello
vio
lin
french-h
orn
corn
et
trom
bone
trum
pet
tuba
flute
pic
colo
record
er
bassoon
cla
rinet
english-h
orn
oboe
piano 36 3 4 4 2 2 1 1 2guitar 29 48 12 1 8harp 24 22 68 2 3 5 2viola-pizz 1 6 85 1 9 4 2bass-pizz 1 3 76 12cello-pizz 2 20 2 4 18 71 1 1violin-pizz 3 1 6 1 88 2 4 3
viola 44 5 14bass 2 93 4 6 1 2 1 3cello 1 37 5 68 16 1 2 1 1 5violin 14 3 55 1 10 1french-horn 1 1 50 13 1 15 4 5 2cornet 2 1 30 3 13 2 1 4trombone 15 15 49 7 1 1 2trumpet 1 47 10 61 3 2 2tuba 2 2 23 7 79flute 1 2 5 2 3 4 1 77 10 10 1 23 2 4piccolo 1 4 71 5 5 8recorder 1 2 4 59bassoon 1 4 5 1 2 2 12 3 81 12 1clarinet 1 1 2 1 7 2 4 5 1 10 46 10 20english-horn 1 1 1 3 3 1 12 4oboe 4 1 4 9 3 1 3 1 14 49 58number of sounds 146 159 130 54 186 170 97 225 280 356 264 242 53 202 157 140 323 83 39 203 212 41 184
original class
recogniz
ed c
lass
5. Evaluation5. EvaluationConfusion matrix Confusion matrix 5. Evaluation5. EvaluationConfusion matrix Confusion matrix
Lowest recognition rates -> smallest training setsLowest recognition rates -> smallest training sets
pia
no
guitar
harp
vio
la-p
izz
bass-p
izz
cello-p
izz
vio
lin-p
izz
vio
la
bass
cello
vio
lin
french-h
orn
corn
et
trom
bone
trum
pet
tuba
flute
pic
colo
record
er
bassoon
cla
rinet
english-h
orn
oboe
piano 36 3 4 4 2 2 1 1 2guitar 29 48 12 1 8harp 24 22 68 2 3 5 2viola-pizz 1 6 85 1 9 4 2bass-pizz 1 3 76 12cello-pizz 2 20 2 4 18 71 1 1violin-pizz 3 1 6 1 88 2 4 3
viola 44 5 14bass 2 93 4 6 1 2 1 3cello 1 37 5 68 16 1 2 1 1 5violin 14 3 55 1 10 1french-horn 1 1 50 13 1 15 4 5 2cornet 2 1 30 3 13 2 1 4trombone 15 15 49 7 1 1 2trumpet 1 47 10 61 3 2 2tuba 2 2 23 7 79flute 1 2 5 2 3 4 1 77 10 10 1 23 2 4piccolo 1 4 71 5 5 8recorder 1 2 4 59bassoon 1 4 5 1 2 2 12 3 81 12 1clarinet 1 1 2 1 7 2 4 5 1 10 46 10 20english-horn 1 1 1 3 3 1 12 4oboe 4 1 4 9 3 1 3 1 14 49 58number of sounds 146 159 130 54 186 170 97 225 280 356 264 242 53 202 157 140 323 83 39 203 212 41 184
original class
recogniz
ed c
lass
5. Evaluation5. EvaluationConfusion matrix Confusion matrix 5. Evaluation5. EvaluationConfusion matrix Confusion matrix
Confusion piano / guitar-harpConfusion piano / guitar-harp
pia
no
guitar
harp
vio
la-p
izz
bass-p
izz
cello-p
izz
vio
lin-p
izz
vio
la
bass
cello
vio
lin
french-h
orn
corn
et
trom
bone
trum
pet
tuba
flute
pic
colo
record
er
bassoon
cla
rinet
english-h
orn
oboe
piano 36 3 4 4 2 2 1 1 2guitar 29 48 12 1 8harp 24 22 68 2 3 5 2viola-pizz 1 6 85 1 9 4 2bass-pizz 1 3 76 12cello-pizz 2 20 2 4 18 71 1 1violin-pizz 3 1 6 1 88 2 4 3
viola 44 5 14bass 2 93 4 6 1 2 1 3cello 1 37 5 68 16 1 2 1 1 5violin 14 3 55 1 10 1french-horn 1 1 50 13 1 15 4 5 2cornet 2 1 30 3 13 2 1 4trombone 15 15 49 7 1 1 2trumpet 1 47 10 61 3 2 2tuba 2 2 23 7 79flute 1 2 5 2 3 4 1 77 10 10 1 23 2 4piccolo 1 4 71 5 5 8recorder 1 2 4 59bassoon 1 4 5 1 2 2 12 3 81 12 1clarinet 1 1 2 1 7 2 4 5 1 10 46 10 20english-horn 1 1 1 3 3 1 12 4oboe 4 1 4 9 3 1 3 1 14 49 58number of sounds 146 159 130 54 186 170 97 225 280 356 264 242 53 202 157 140 323 83 39 203 212 41 184
original class
recogniz
ed c
lass
5. Evaluation5. EvaluationConfusion matrix Confusion matrix 5. Evaluation5. EvaluationConfusion matrix Confusion matrix
Cross-family confusions Cross-family confusions
5. Evaluation5. EvaluationConfusion matrixConfusion matrix5. Evaluation5. EvaluationConfusion matrixConfusion matrix
Cross-family confusionsCross-family confusions
Cornet Cornet -> Bassoon-> Bassoon
Cornet Cornet -> English-horn -> English-horn
Flute Flute -> Clarinet-> Clarinet
Oboe Oboe -> Flute-> Flute
Trombone Trombone -> Flute-> Flute
Feature extractionFeature extractionFeature selectionFeature selectionFeature TransformFeature TransformClassificationClassificationEvaluationEvaluation
Confusion matrixConfusion matrixWhich featuresWhich featuresClasses organizationClasses organization
sust./non-susts among non-sust. among sust. among bow ed-string among brass among air reeds among sing/dbl reeds
temporal increase temporal decrease temporal decrease temporal decrease
temporal decrease temporal centroid
temporal log-attack
spectral centroid spectral centroid spectral spread spectral centroid spectral centroid spectral skew ness spectral centroid
spectral spread spectral spread spectral skew ness spectral spread spectral skew ness spectral kurtosis + std spectral spread
spectral skew ness spectral kurtosis + std sharpness spectrall kurtosis std spectral slope spectral skew ness
spectral variation spectrall skew ness std spectral variation std
spectral decrease std spectral kurtosis
harmonic deviation harmonic deviation tristimulus noisiness harmonic deviation tristimulus
tristimuls std harmonic deviation
mfcc2,6 std various mfcc mfcc3,4,6 xcorr 3, 6, 8 xcorr3 xcorr3
5. Evaluation5. EvaluationMain selected features Main selected features 5. Evaluation5. EvaluationMain selected features Main selected features
Par FSA (IRMFSP)Par FSA (IRMFSP)
DTg_decr <= 10.033592
| DPi_specloud_m17-mm <= 0.013381
| | DSi_sc_v4-ss <= 0.164903
| | | DPi_specloud_m5-mm <= 0.0124: htb (18.0/11.0)
| | | DPi_specloud_m5-mm > 0.0124
| | | | DPi_sc_v2-mm <= 443.455871
| | | | | DPi_ss_v1-mm <= 477.501186
| | | | | | DPi_loud_v-mm <= 5.759929: cb-pizz (10.0/6.0)
| | | | | | DPi_loud_v-mm > 5.759929
| | | | | | | DTi_xcorr_m11-mm <= -0.272094: cor (50.0/7.0)
| | | | | | | DTi_xcorr_m11-mm > -0.272094
| | | | | | | | DPg_flustr_v7 <= 0.006614: tubb (10.0/5.0)
| | | | | | | | DPg_flustr_v7 > 0.006614
| | | | | | | | | DPi_Dmfcc_m3-mm <= -0.013356: cor (19.0/7.0)
| | | | | | | | | DPi_Dmfcc_m3-mm > -0.013356: tubb (67.0/3.0)
5. Evaluation5. EvaluationMain selected featuresMain selected features5. Evaluation5. EvaluationMain selected featuresMain selected features
Par arbre de décision (C4.5)Par arbre de décision (C4.5)
DTg_decr > 10.033592
| DTg_incr <= -0.744688
| | DPi_tri_v7-mm <= 0.035614
| | | DPg_roughn_v4 <= 0.120563
| | | | DTg_ed <= 0.278571
| | | | | DSi_skew_v6-mm <= -1.673777: vln-pizz (53.0)
| | | | | DSi_skew_v6-mm > -1.673777: alto-pizz (34.0/9.0)
| | | | DTg_ed > 0.278571: harp (10.0/4.0)
| | | DPg_roughn_v4 > 0.120563: picc (11.0/7.0)
| | DPi_tri_v7-mm > 0.035614
DTg_incr <= -1.670978 AND
DTg_lat <= -0.982531 AND
DPi_specloud_m1-mm <= 0.012608 AND
DSi_variation_v1-mm > 0.001828 AND
DSi_kurto_v6-mm > 6.786784: vln-pizz (82.0/1.0)
DPi_ss_v4-mm > 0.897333 AND
DHi_devs_v3-mm > 2.790707 AND
DHi_oeratio_v1-mm > 2.250247: clsb (74.0/5.0)
DPi_ss_v4-mm > 0.950127 AND
DPi_DDmfcc_m7-ss > 0.009458 AND
DPg_roughn_v6 > 0.079858 AND
DHg_mod_am > 0.000158 AND
DPi_specloud_m21-mm > 0.026443 AND
DPi_specloud_m5-mm <= 0.114309 AND
DPi_DDmfcc_m3-mm > -0.000202: vln (66.0/8.0)
5. Evaluation5. EvaluationMain selected featuresMain selected features5. Evaluation5. EvaluationMain selected featuresMain selected features
Par arbre de décision, décision regroupée (PART)Par arbre de décision, décision regroupée (PART)
Feature extractionFeature extractionFeature selectionFeature selectionFeature TransformFeature TransformClassificationClassificationEvaluationEvaluation
Confusion matrixConfusion matrixWhich featuresWhich featuresClasses organizationClasses organization
7. Instrument Class Similarity7. Instrument Class Similarity7. Instrument Class Similarity7. Instrument Class Similarity
Goal:Goal: check that the proposed tree check that the proposed tree
structure corresponds to natural structure corresponds to natural class organization class organization
How ?How ? Most people use Martin hierarchyMost people use Martin hierarchy 1) check the grouping among the 1) check the grouping among the
decision trees leavesdecision trees leaves 2) MDS ?2) MDS ?
Gui
tar
Pian
o
Viol
in p
izz
Trum
pet
Flut
e
Har
p
Viol
a pi
zz
Cello
piz
z
Dou
ble
pizz
Viol
in
Viol
a
Cello
Dou
ble
Corn
et
Trom
bone
Fren
ch H
orn
Tuba
Picc
olo
Reco
rder
GuitarHarp
Strings Woodwinds
Non Sustained
Instrument
Sustained
Struck Strings Plucked Strings Pizz Strings
Piano ViolinViolaCello
Double
Bowed Strings BrassSingle Double
ReedsAir Reeds
ViolinViolaCello
Double
TrumpetCornet
TromboneFrench Horn
Tuba
Single ReedsClarinet
Tenor saxAlto saxSop sax
AccordeonDouble Reeds
OboeBassoon
English horn
FlutePiccolo
Recorder
T1
T2
T3
??
MDS on acoustic features ? MDS on acoustic features ? [Herrera AES114th] [Herrera AES114th] Compute the dissimilarity between each class Compute the dissimilarity between each class How ?Compute the between-group F-matrix between class modelsHow ?Compute the between-group F-matrix between class models Observe the dissimilarity between the classesObserve the dissimilarity between the classes How ? MDS (Multi-dimensional scaling) analysisHow ? MDS (Multi-dimensional scaling) analysis
MDS preserve as much as possible distances between the dataMDS preserve as much as possible distances between the dataand allows representing them into a lower dimensional spaceand allows representing them into a lower dimensional space
usually MDS is used for representing dissimilarity judgements (Timbre similarity), usually MDS is used for representing dissimilarity judgements (Timbre similarity), used here on acoustic featuresused here on acoustic features
MDS (Kruskal’s STRESS formula 1 scaling method)MDS (Kruskal’s STRESS formula 1 scaling method) 3 dimensional space3 dimensional space
7. Instrument Class Similarity7. Instrument Class Similarity7. Instrument Class Similarity7. Instrument Class Similarity
Clusters ?Clusters ? non-sustained soundsnon-sustained sounds
PIAN PianoGUI GuitarHARP HarpVLNP Violin pizzVLAP Viola pizzCELLP Cello pizzDBLP Double pizz
VLN Violin pizzVLA Viola pizzCELL Cello pizzDBL Double pizzTRPU TrumpetCOR CornetTBTB TromboneFHOR French-hornTUBB TubaFLTU FlutePICC PiccoloRECO RecorderCLA ClarinetSAXTE Tenor saxSAXAL Alto saxSAXSO Soprano saxACC AccordeonOBOE OboeBS BassoonEHOR English-horn
7. Instrument Class Similarity7. Instrument Class Similarity7. Instrument Class Similarity7. Instrument Class Similarity
Clusters ?Clusters ? non-sustained soundsnon-sustained sounds Bowed-strings soundsBowed-strings sounds
PIAN PianoGUI GuitarHARP HarpVLNP Violin pizzVLAP Viola pizzCELLP Cello pizzDBLP Double pizz
VLN Violin pizzVLA Viola pizzCELL Cello pizzDBL Double pizzTRPU TrumpetCOR CornetTBTB TromboneFHOR French-hornTUBB TubaFLTU FlutePICC PiccoloRECO RecorderCLA ClarinetSAXTE Tenor saxSAXAL Alto saxSAXSO Soprano saxACC AccordeonOBOE OboeBS BassoonEHOR English-horn
7. Instrument Class Similarity7. Instrument Class Similarity7. Instrument Class Similarity7. Instrument Class Similarity
Clusters ?Clusters ? non-sustained soundsnon-sustained sounds Bowed-strings soundsBowed-strings sounds Brass sounds (TRPU ?)Brass sounds (TRPU ?)
PIAN PianoGUI GuitarHARP HarpVLNP Violin pizzVLAP Viola pizzCELLP Cello pizzDBLP Double pizz
VLN Violin pizzVLA Viola pizzCELL Cello pizzDBL Double pizzTRPU TrumpetCOR CornetTBTB TromboneFHOR French-hornTUBB TubaFLTU FlutePICC PiccoloRECO RecorderCLA ClarinetSAXTE Tenor saxSAXAL Alto saxSAXSO Soprano saxACC AccordeonOBOE OboeBS BassoonEHOR English-horn
PIAN PianoGUI GuitarHARP HarpVLNP Violin pizzVLAP Viola pizzCELLP Cello pizzDBLP Double pizz
VLN Violin pizzVLA Viola pizzCELL Cello pizzDBL Double pizzTRPU TrumpetCOR CornetTBTB TromboneFHOR French-hornTUBB TubaFLTU FlutePICC PiccoloRECO RecorderCLA ClarinetSAXTE Tenor saxSAXAL Alto saxSAXSO Soprano saxACC AccordeonOBOE OboeBS BassoonEHOR English-horn
7. Instrument Class Similarity7. Instrument Class Similarity7. Instrument Class Similarity7. Instrument Class Similarity
Clusters ?Clusters ? non-sustained soundsnon-sustained sounds Bowed-strings soundsBowed-strings sounds Brass sounds (TRPU ?)Brass sounds (TRPU ?) mix between single/double mix between single/double
reeds and brass reeds and brass instrumentsinstruments
7. Instrument Class Similarity7. Instrument Class Similarity7. Instrument Class Similarity7. Instrument Class Similarity
Dimension 1:Dimension 1: separate sustained sounds / separate sustained sounds /
non sustained soundsnon sustained sounds negative values: PIAN, GUI, negative values: PIAN, GUI,
HARP, VLNP, VLAP, CELLP, HARP, VLNP, VLAP, CELLP, DBLPDBLP
-> attack-time, decrease time-> attack-time, decrease time
PIAN PianoGUI GuitarHARP HarpVLNP Violin pizzVLAP Viola pizzCELLP Cello pizzDBLP Double pizz
VLN Violin pizzVLA Viola pizzCELL Cello pizzDBL Double pizzTRPU TrumpetCOR CornetTBTB TromboneFHOR French-hornTUBB TubaFLTU FlutePICC PiccoloRECO RecorderCLA ClarinetSAXTE Tenor saxSAXAL Alto saxSAXSO Soprano saxACC AccordeonOBOE OboeBS BassoonEHOR English-horn
7. Instrument Class Similarity7. Instrument Class Similarity7. Instrument Class Similarity7. Instrument Class Similarity
Dimension 1:Dimension 1: separate sustained sounds / separate sustained sounds /
non sustained soundsnon sustained sounds negative values: negative values:
PIAN, GUI, HARP, VLNP, VLAP, PIAN, GUI, HARP, VLNP, VLAP, CELLP, DBLPCELLP, DBLP
-> attack-time, decrease time-> attack-time, decrease time
Dimension 2Dimension 2:: brightnessbrightness dark sounds:dark sounds:
TUBB, BSN, TBTB, FHORTUBB, BSN, TBTB, FHOR bright sounds: bright sounds:
PICC, CLA, FLUTPICC, CLA, FLUT problem DBL ?problem DBL ?
PIAN PianoGUI GuitarHARP HarpVLNP Violin pizzVLAP Viola pizzCELLP Cello pizzDBLP Double pizz
VLN Violin pizzVLA Viola pizzCELL Cello pizzDBL Double pizzTRPU TrumpetCOR CornetTBTB TromboneFHOR French-hornTUBB TubaFLTU FlutePICC PiccoloRECO RecorderCLA ClarinetSAXTE Tenor saxSAXAL Alto saxSAXSO Soprano saxACC AccordeonOBOE OboeBS BassoonEHOR English-horn
7. Instrument Class Similarity7. Instrument Class Similarity7. Instrument Class Similarity7. Instrument Class Similarity
Dimension 1:Dimension 1: separate sustained sounds / separate sustained sounds /
non sustained soundsnon sustained sounds negative values: negative values:
PIAN, GUI, HARP, VLNP, VLAP, PIAN, GUI, HARP, VLNP, VLAP, CELLP, DBLPCELLP, DBLP
-> attack-time, decrease time-> attack-time, decrease time
Dimension 2Dimension 2:: brightnessbrightness dark sounds dark sounds
TUBB, BSN, TBTB, FHORTUBB, BSN, TBTB, FHOR bright sounds: bright sounds:
PICC, CLA, FLUTPICC, CLA, FLUT problem DBL ?problem DBL ?
Dimension 3Dimension 3:: ?? Separation of bowed stings Separation of bowed stings
(VLN, VLA, CELL, DBL)(VLN, VLA, CELL, DBL) amount of modulation ?amount of modulation ?
PIAN PianoGUI GuitarHARP HarpVLNP Violin pizzVLAP Viola pizzCELLP Cello pizzDBLP Double pizz
VLN Violin pizzVLA Viola pizzCELL Cello pizzDBL Double pizzTRPU TrumpetCOR CornetTBTB TromboneFHOR French-hornTUBB TubaFLTU FlutePICC PiccoloRECO RecorderCLA ClarinetSAXTE Tenor saxSAXAL Alto saxSAXSO Soprano saxACC AccordeonOBOE OboeBS BassoonEHOR English-horn
Conclusion ?Conclusion ?
ConclusionConclusionConclusionConclusion
State of the artState of the art Martin [1999] Martin [1999] 76% (family)76% (family) 39% for 14 instruments39% for 14 instruments Eronen [2001] Eronen [2001] 77% (family)77% (family) 35% for 16 instruments35% for 16 instruments
This studyThis study 85% (family)85% (family) 64% for 23 64% for 23
instrumentsinstruments increased recognition rates mainly explained by the use of new featuresincreased recognition rates mainly explained by the use of new features
PerspectivesPerspectives derive automatically the tree structure (analysis of decision tree ?)derive automatically the tree structure (analysis of decision tree ?) test other classification algorithm (GMM, SVM, …)test other classification algorithm (GMM, SVM, …) test the system for other sound classes (non-instrumental sounds, sound FX)test the system for other sound classes (non-instrumental sounds, sound FX) extend the system to musical phrasesextend the system to musical phrases extend the system to polyphonic soundsextend the system to polyphonic sounds extend the system to multi-sources soundsextend the system to multi-sources sounds
Links:Links: http://www.cuidado.muhttp://www.cuidado.muhttp://www.cs.waikato.ac.nz/ml/weka/http://www.cs.waikato.ac.nz/ml/weka/