Towards a Characterization of Musical Timbre Based on Chroma Contours

Towards Characterization of Music Timbrebased on Contour Chroma

Hassan Ezzaidi1, Mohammed Bahoura2, and Glenn Eric Hall1

1 Universite du Quebec a Chicoutimi,555, boul. de l’Universite, Chicoutimi, Qc, Canada, G7H 2B1.

{hezzaidi,glennerichall}@uqac.ca

2 Universite du Quebec a Rimouski,300, allee des Ursulines, Rimouski, Qc, Canada, G5L 3A1.

[email protected]

Abstract. Chroma conveying mainly a tonal content is considered aspowerful representation that is widely used in musical information re-trieval applications. In this paper, a new music timbre description basedonly on the chromagram contours is investigated allowing the identi-fication of both tonal content and particularly the timbre instrument(identity). After some steps of pre-processing and transformation, fourmethods are investigated as classifiers: support vector machine (SVM),neural network, invariant moments, and template matching based cross-correlation. All methods use only one pattern in training phase. Resultsare very promising and the graphical analysis demonstrates that con-tours are dependent of the music instrument. As first investigation, per-formance of about 70% is obtained with template matching and SVMtechniques.

Keywords: instrument; recognition; chroma; contour; multimedia.

1 Introduction

The explosion of mass media, particularly the Internet and digital audio format,and the large amounts of musical contents available on the public and privatemedia generate new needs as: maintenance, classification and authenticationtasks. One way to accomplish these tasks is by extracting from raw audio signalsome compact and pertinent descriptors that can be used as basic prototypes formatching techniques.The musical sound is generally characterized by pitch, loud-ness and timbre. Timbre is considered to be distinctive between two instrumentsplaying the same note with the same pitch and loudness. Timbre descriptors areviewed as one of the most important features to convey the identity of musi-cal instrument. It is possible to notice some analogies and similarities betweenspeech recognition and musical analysis systems. Identifying musical instrumentsis comparable to speaker identification though very different in practice. Theidentity of the speaker is represented, in the majority of speaker identificationsystems, by just a set of Mel Frequency Cepstral Coefficients (MFCC) as features

2 H. Ezzaidi, M. Bahoura, and G.-E. Hall,

and the Gaussian Mixture Models (GMM) as classifier. In the case of the musi-cal instrument, features and models varies considerably according to contextualfactors (musician, instrument manufacturer, tuning, etc.). Until now, there isno consensus for attributes determining the exact signature of musical instru-ments comparatively to speaker recognition systems in which only MFCC fea-tures without any others parameters aggregation achieve the best performance.The previous works have opted for the representation of the musical timbre, bythe concatenation of several descriptors exploiting different assumptions relatedto human perceptive dimension or psycho-acoustical knowledge. In principle, thecomponents estimated form acoustical signal, censed to determinate dimensionof timbre, are estimated generally from the spectral envelope, temporal envelope,MFCC, Linear Predictive Coding (LPC), statistical moments, pitch and onset.Psycho-acousticians sketches timbre as a geometric construction built from simi-larity ratings. Multidimensional scaling is generally used to find sound attributesthat correlate best with the perceptual dimensions (brightness, smoothness, com-pactness, etc.) [1,2]. This variety of parameters validates evidence that it is notpossible with the available technical tools to represent the tone by unique sin-gle feature based instrument and therefore common descriptor based family fordifferent musical sound. However, the challenge becomes to determine whichattributes characterize best the multidimensional perceptual timbre.

In this work, the main research content is to propose and investigate a newand single representation describing and characterizing the timbre of music with-out any concatenation. This approach consists to extract several levels contoursfrom chromagram called chromatimbre that seems to be unique for each familyinstrument and conveys both tonal content and identity information. Classicalfeed-forward neural network (NN), Support Vector Machine (SVM), templatematching using cross-correlation principle and the statistical invariants momentswere proposed as classifiers in experiment of this work. Results show a great po-tential and promising approach for future work.

2 State of art

All proposed features in the last years attempt to describe the multidimen-sional vector representing the perceptive human sensation into the timbre space.Since several decades, various parameters derived from time attack, time re-lease, spectral centroid, harmonic partials, onset and frequency cutoff exhibitrelevant information to characterize quality attributes of timbre instrumentsas orchestral instruments, bowed string, brightness, harmonic and inharmonicstructure, etc [1,2,3]. Recently, many features related to characterize the soundsource excitation and the resonant instrument structure extracted from trans-formed correlogram were suggested in [4]. All the 31 features extracted from eachtone based statistical measures are related to pitch, harmonic structure, attack,tremolo and vibrato proprieties. They are assumed to capture a partial infor-mation of tone color (timbre). In addition, assuming that the human auditoryperception system is organized and recognizes sounds in a hierarchical manner,

Towards Characterization of Music Timbre based on Contour Chroma 3

a similar classification scheme was suggested and compared in the same work [4].Results show a score improvement about of 6% for individual instrument and8% for instrument family recognitions. Instead, Eronen [5] exploited the psy-choacoustic knowledge to determine features parameters describing the timbremusic. Essentially, statistical measures based pitch, onset, amplitude modula-tion, MFCC, LPC and their derivatives are investigated as parameters. Resultsshow that the MFCC and derivatives extracted from the onset and steady statesegment give mainly the best performance comparatively to others aggregatedfeatures. Performance comparison between direct and the hierarchical classifica-tion techniques was examined in [4,6] showing a particular interest with the lasttechnique. Particularly, Hall et al. [6] used 6698 notes with the hierarchical clas-sification proposed in [5] and constructed a system where the feature vector isdynamic and changes depending on each level and each node of the hierarchicaltree. The feature vector was thus optimized and determined with the SequentialForward Selection (SFS) algorithm. Using the Real World Computing (RWC)music database, the results showed a score gain in musical instrument recogni-tion performance [7]. Kitahara et al. [8,9] used pitch-dependent algorithms as anF0-dependent multivariate normal distribution, where each element of the meanvector is represented by a function of F0.

3 Database

The ”RWC Music Database for Musical Instrument Sound” [7] is chosen in thiswork. In this database, each audio file contains the signal of a single instrumentplayed with isolated notes. The use of isolated notes, that is only one note ata time, has significant advantages for feature extraction: sophisticated acousticdescriptors are hardly calculable from a continuous flow of notes, which arelikely to overlap. The database provides multiple records for each instrument:different manufacturers for the same instrument and different musicians tookpart to generate records and provide a range of several instrumental signatures.For each instrument, the musician is playing each note individually at an intervalof a semitone over the entire possible range of the instrument. In terms of stringinstruments, the full range for each chord is played. Dynamics is also variedwith intensities strong, mezzo and piano. In this experiment, 9 instruments (noduplicated pitch instances for each instrument) were selection with various notesfor a total of 50 notes: accordion (reeds), acoustic guitar (steel string), electricguitar (string), clarinet (reeds), alto sax (brass), piano (string), violin (string),flute and trumpet (brass).

4 Proposed system

4.1 Chromagram estimation

Chromagram is defined as the whole spectral audio information mapped intoone octave. Each octave is divided into 12 bins representing each one semitone.


The same strategy based on instantaneous frequency (IF), presented in [10], isadopted in this work to compute the features chroma. The audio signal, withsampling frequency of 11025 Hz, is split up into frames (1024 points) interlacedover 512 points. Motivation behind the IF is to track only real harmonics.

4.2 Chromagram contour or chromatimbre estimation

Each two-dimensional chroma matrix is associated with time axis and bin fre-quency axis (semitone note). We utilize the contour function of MATLAB, whichdetermines 10 contours levels by using a linear interpolation. Each contour track-ing represent the intensity variation with respect to a fixed threshold for yieldinga segmentation of chromagram representation (image) producing several regions.Hence, contours delimiting the frontiers give some description equivalent to theacoustical scene auditory activity. To deal with variability level, all contours areset to the same intensity. This is similar to transforming a color image to blackand white. This binary encoding approximation is used just to accelerate andfacilitate the continuation of this exploratory study. Fig. 1 and Fig. 2 illustrateboth the chroma representation (left side) and the timbre description for fluteand piano instruments playing different tones. According to geometrical shapecontours, it is clear that chroma shows a great energy concentrated at small in-terval centred at bin number 4 for the two different note mode (C4 and C6). Thesame effect is obtained with the piano instrument playing G4 and G6 pitch inter-val what confirm that chroma preserve mainly the tonal content. On the otherside, the contours representation (right side in Fig. 1 and Fig. 2) with the sameinstrument, exhibit rather than the tonal content, a particular pattern shapeassumed to characterize timbre information. Illustrations beside patterns shapesseem to keep and conserve the same geometrical propriety when an instrumentplayed different notes. An additional illustration is given in Fig. 3, where 6 instru-ments are inspected. Other graphical analysis reveals several visual attributesthat can be targeted to characterize instrument identity from the chromatimbreimage. Pizzicato is especially clearly visible and there is no ambiguity to dis-tinguish sustained instruments from pizzicato instruments. The chromatimbreof the accordion is especially easy to recognize because of its unique signature.However, the shapes of chromatimbre are not trivial and it would be difficult toenumerate all the characteristics that can have each instrument. In addition, thechromatimbre can take many forms for a single instrument. Being a projectioninto the time-frequency plane, the chromatimbre provides an overview of theenvelope and the spectrum of the note. We can easily see on the chromatimbrepattern the spectrum spread, amplitude and frequency modulations, attack time,sustain and release of the note. In this context, the need to elaborate simplifiedmethods to explore the potential of chromatimbre approach is crucial.

4.3 Contour pattern normalization

The first challenge with contour parametrization is to deal with the time speedvariation (stretching and compression), the frequency hopping and/or spread


Fig. 1. Description by chromagram and the proposed chromatimbre: Piano instrumentplaying C4 and C6 notes.

Fig. 2. Description by chromagram and the proposed chromatimbre: Flute instrumentplaying C4 and C6 notes.

related by the change of playing different tones. Hence, all chromatimbre de-scriptions are transformed to the Portable Gray Map (PGM) Format yielding amatrix size of 315x420. Here, frequency octave is represented by 315 points and


Fig. 3. Spectro-temporel description transformed to the Portable Gray Map (PGM)Format (315x420 matrix): Intervariability chromatimbre instruments notes.

the time axis by 420 points assuring the invariance of the time length durationvariability. The appearance and presence of secondary contours represented asecond challenge to take into account. So, extracting the dominant contour cansimply effectuated by first projecting the temporal contour variation among 12bins tones axis to form an histogram (distribution). Then by using both zerocrossing and the maximum value over histogram, the secondaries contours pat-terns were eliminated.

4.4 Preparation of train and test data and features estimation

Assuming that the chromatimbre is characteristic for timbre, the entire patterncontours are computed to extract suitable features for representing timbre in-strument identity. In one case, only the contours localized at the attack timephase of the signal have been proposed as a template for identifying the tim-bre of 9 musical instruments. These contours extracted for each instrument arerepresented by a matrix of 50x54 points. Hence, no clustering algorithm wasnecessary for partitioning the data into clusters. This constitutes the first typeof features proposed in this experiment. In second case, the entire image is takeninto consideration by splinting it into rectangular window moving horizontallyand vertically. Each fragmentation yields a small image with size of 50x54 points.The idea behind this, is to cover all phases of the dynamic rating (attack, release,and sustain). Images with contour points (on pixels) less than a rate 15% areignored and considered as silence images. This preparation corresponds to thesecond type of features presented to systems recognition. Notice, that the size


of the first set of pattern features is 9 one for each type of instrument. In thesecond case, the size is increased by more than 100%, which is huge but allowshaving more precision for the evolution of the dynamic note.

4.5 Model classification

At the classification level and the decision operation, three strategies are exam-ined for the proposed chromatimbre features (see sub-section 4.4) including ornot normalized vectors as follows:

• Template matching : it consists to compute a cross correlation between theunknown input pattern and reference pattern for each note. Maximum corre-lation computed over all references patterns is used to design and determinethe winning instrument. Here, one pattern image (instrument’s chromatim-bre) is memorized as prototype for a total of 9. As example, Fig. 4 illustratesthe correlation between all the 50 music instruments. Notice, that the diag-onal is set to zero, in order to highlight the inter-instruments similaritymeasures.

• Invariant moments: the statistical moments, used largely in image process-ing, are assumed to be invariant under changes in translation, scale, androtation are computed for each pattern instrument and stoked as codewords[11]. Each codeword has a vector composed from the seven statistical invari-ant moments. Nearest neightbour (k -NN) algorithm determines the decisionof classification. The same training data as template matching method isused for invariant moments.

• Support vector Machine (SVM): this technique has shown better perfor-mance for recognition systems [12]. Dichotomy classification as the originalproposition of SVM is proposed in this work using quadratic programmingalgorithm.

• Neural network (NN): feed-forward network including a set of 200 cells in thehidden layers is investigated as another tool for chromatimbre identificationand classification.

In both case, SVM and NN, the second set with many features is used intraining and testing session. Recall that our ambition is principally to exploremeaningful information convoyed by chromatimbre to identify the musical in-strument.

5 Results

As mentioned previously, the invariant moments and template matching tech-niques used the first feature set composed from a few data (one pattern bycluster) in training session. The remainder classification techniques employedthe second feature set with huge data training. In all training sessions, onlynine different instruments were presented but in the testing session 50 notesoriginating from the nine instruments were used. The result obtained by the


Fig. 4. Cross-correlation matrix between 50 different tone instruments (9 families)

invariant moments was the lowest score. Therefore, this technique was simplyignored without searching any amelioration issue. On the other hand, the tem-plate matching based cross-correlation gave the highest score (69.7 %) usingonly a few normalized features. Comparatively, a similar score was obtainedwith SVM (69.4 %) and NN (67.7 %) classifiers. Both systems were trained withdata size more than 100 % of those used for template matching technique. De-tails of all the performance results are given in Table 1 where the NN1 columnindicates those training data patterns that are presented sequentially from eachimage contours. In the NN2 column input data were selected randomly. I1 toI9 lines correspond to the performance recorded by the following instrumentsfamily: I.1= flute, I.2= piano, I.3= trumpet (brass), I.4=violin (string), I.5=accordion (reeds), I.6= clarinet; I.7= alto sax (brass), I.8= acoustic guitar (steelstring) and I.9 = electric guitar (string). It was observed that clarinet was thetrickiest instrument to recognize using template matching. SVM and NN rec-ognize perfectly some instruments to the detriment of other instruments. Theuse of a larger database would probably enhance score performances. However,template matching based cross-correlation remained the most efficient one dueto the almost balanced performance between different instruments, and princi-pally for using a complete pattern of instrument chromatimbre. Uniform patternfragmentation process may cause a forget factor on the temporal coordinationof the original image. It should be added that the used approximations can dis-regard some details often useful to discriminate between similar instruments bytheir timbre properties. Finally, it was reported that with training data all theproposed systems obtained a perfect score.


Table 1. Recognition rates obtained with (SVM), Cross-correlation, feed-forward net-work (NN) classification techniques: NN1 is trained by data presented sequentially,while the NN2 is trained by data randomly.

With used database, some instruments had clearly mutual distinctive chro-matimbre. This was especially true with the guitar, the piano and the accordion,since their excitation source is much less variable than with wind instrumentsblown by mouth, for which the musician can easily make many more intensityvariations and sustained amplitude modulations. Clarinet and saxophone willhave rather similar shapes (reed instruments), particularly if they are calculatedwith only 12 bins. Using more bins in the calculation of the chromagram makesit easier to distinguish the inharmonicity and the frequency shifts of the attack.For example, the clarinet seemed to always have a frequency up shifting in theattack and a frequency down shifting upon release of note, maybe due to thereed transients. With more bins, the clarinet was easily distinguishable, at leastvisually. The impact of the number of bins in the classification implied thatmore bins increased significant distinctive details but made it harder to extracta single contour, since the chromatimbre shape will spread into multiple bins.Ultimately, an efficient system using chromatimbre features will have to use morethan 12 bins per octave to seek contours. Characterization of timbre by chroma-timbre (representation contours) appears to be a very promising approach thatcan be refined in the future to determine more robust encoding approaches forthe contours representation.

6 Conclusion

The main topic research in this work is to present a set of new features to iden-tify music timbre. Features are extracted from contours of chromagram withsome post-treatment and transformation. In fact, chroma is considered to ex-hibit mainly a tonal content and was used as powerful representation for manymusical information retrieval applications. The proposed chromatimbre contoursbased shape demonstrates an interesting alternative representation allowing theidentification of both tonal content and particularly the instrument timbre (iden-tity). In addition, no concatenation is necessary; treatment is very simple, com-patible and implemental for applications in real-time. The invariant moments,


template matching, support vector machine, feed-forward Neural network havebeen proposed for the classification task. Contrary to what is usually done inpractice, here only 18% of data were used during learning phase . In despite oflittle data used in the training phase as well as the several approximations effec-tuated on the original image the performances scores remain close to 70%, sayfor the three proposed techniques. Particularly, template matching based cross-correlation and SVM give better performance. In fact, the SVM classificationoperates by dichotomy that can present new challenge finding optimal arrange-ment of the class sequences. In our case, a class order was adopted randomly.The results are very promising given that visual graphic analysis have also con-vincingly the relevance of the proposed approach. However, several challengesremain to remove on effective treatments to extract and encode the contours.

References

1. J. W. Beauchamp. Time-variant spectra of violin tones. Journal of the AcousticalSociety of America, 56(3):995–1004, 1974.

2. M. D. Freedman. Multidimensional perceptual scaling of musical timbres. Journalof the Acoustical Society of America, 41(4A):793–806, 1967.

3. J. M. Grey. Multidimensional perceptual scaling of musical timbres. Journal ofthe Acoustical Society of America, 61(5):1270–1277, 1977.

4. K. D. Martin and Y. E. Kim. Musical instrument identification: A pattern-recognition approach. In Presented at the 136th meeting of the Acoustical Societyof America, 1998.

5. A. Eronen. Automatic Musical Instrument Recognition. Master’s thesis, Depart-ment of Information Technology, Tampere University of Technology, Tampere, Fin-land, 2001.

6. G.-E. Hall, H. Ezzaidi, and M. Bahoura. Hierarchical parametrization and classifi-cation for instrument recognition. In the 11th International Conference on Informa-tion Science, Signal Processing and their Applications (ISSPA), pages 1066–1071,Montreal, Canada, 2-5 July 2012.

7. M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka. RWC Music Database: MusicGenre Database and Musical Instrument Sound Database. In the 4th Interna-tional Conference on Music Information Retrieval (ISMIR 2003), pages 229–230,Baltimore, Maryland, 26-30 October 2003.

8. T. Kitahara, M. Goto, and H. G. Okuno. Pitch-Dependent Identification of MusicalInstrument Sounds. Applied Intelligence, 23:267–275, 2005.

9. T. Kitahara, M. Goto, and H. G. Okuno. Musical instrument identification basedon F0-dependent multivariate normal distribution. In IEEE International Confer-ence on Acoustics, Speech, and Signal Processing (ICASSP’03), volume 5, pages421–424, 6-10 April 2003.

10. D. Ellis. Classifying Music Audio with Timbral and Chroma Features. In the 8thInternational Conference on Music Information Retrieval (ISMIR 2007), pages339–340, Vienna, Austria, 23-30 September 2007.

11. M. K. Hu. Visual pattern recognition by moment invariants. IRE Transactions onInfomation Theory, 8:179187, 1962.

12. V. N. Vapnik. Statistical learning theory. John Wiley & Sons, Inc., New York, NY,1998.

Documents

Towards a Characterization of Musical Timbre Based on Chroma Contours