10
Audio Engineering Society Convention Paper Presented at the 130th Convention 2011 May 13–16 London, UK The papers at this Convention have been selected on the basis of a submitted abstract and extended precis that have been peer reviewed by at least two qualified anonymous reviewers. This convention paper has been reproduced from the author’s advance manuscript, without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42 nd Street, New York, New York 10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society. A comprehensive and modular framework for audio content extraction, aimed at research, pedagogy and digital library management Olivier Lartillot 1 1 Finnish Centre of Excellence in Interdisciplinary Music Research, University of Jyv¨ askyl¨ a, 40014, Finland Correspondence should be addressed to Olivier Lartillot ([email protected]) ABSTRACT We present a framework for audio analysis and the extraction of low-level features, mid-level structures and high-level concepts, altogether studied as a fully interwoven complex system. Composite operations are constructed via an intuitive programming language on top of Matlab. Datasets of any size can be processed thanks to implicit memory management mechanisms. The data structure enables a tight articulation between signal and symbolic layers in a unified framework. The resulting technology can be used as a pedagogical tool for the understanding of audio, speech and musical processes and concepts, and for content-based discovery of digital libraries. Other applications includes intelligent browsing and structuring of digital library, information retrieval, and the design of content-based audio interfaces. This paper introduces an open-source project for signal processing, that includes a new simple and adaptive syntactic layer on top of the Matlab envi- ronment for the control of high-level operators. The Mining Suite is a complete redesign of our previous framework MIRtoolbox [1] – initially focused on the domain of Music Information Retrieval – that can be applied to extra-musical domains as well. This software tool is the product of an ongoing research project presented in section 1 related to the extraction of three layers of content from au- dio and music. The comprehensive scope of the in- vestigation requires a highly modular methodologi- cal framework, which is discussed in section 2. The main characteristics of the framework are described in section 3 and examples of operators and their ap- plications are given in section 4. Particular memory management capabilities are discussed in section 5

MiningSuite paper presented at 130th AES Convention, London

Embed Size (px)

DESCRIPTION

A comprehensive and modular framework for audio content extraction, aimed at research, pedagogy and digital library management

Citation preview

Page 1: MiningSuite paper presented at 130th AES Convention, London

Audio Engineering Society

Convention PaperPresented at the 130th Convention2011 May 13–16 London, UK

The papers at this Convention have been selected on the basis of a submitted abstract and extended precis that havebeen peer reviewed by at least two qualified anonymous reviewers. This convention paper has been reproduced fromthe author’s advance manuscript, without editing, corrections, or consideration by the Review Board. The AES takesno responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio

Engineering Society, 60 East 42nd Street, New York, New York 10165-2520, USA; also see www.aes.org. All rightsreserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from theJournal of the Audio Engineering Society.

A comprehensive and modular framework foraudio content extraction, aimed at research,pedagogy and digital library management

Olivier Lartillot1

1Finnish Centre of Excellence in Interdisciplinary Music Research, University of Jyvaskyla, 40014, Finland

Correspondence should be addressed to Olivier Lartillot ([email protected])

ABSTRACTWe present a framework for audio analysis and the extraction of low-level features, mid-level structuresand high-level concepts, altogether studied as a fully interwoven complex system. Composite operations areconstructed via an intuitive programming language on top of Matlab. Datasets of any size can be processedthanks to implicit memory management mechanisms. The data structure enables a tight articulation betweensignal and symbolic layers in a unified framework. The resulting technology can be used as a pedagogicaltool for the understanding of audio, speech and musical processes and concepts, and for content-baseddiscovery of digital libraries. Other applications includes intelligent browsing and structuring of digitallibrary, information retrieval, and the design of content-based audio interfaces.

This paper introduces an open-source project forsignal processing, that includes a new simple andadaptive syntactic layer on top of the Matlab envi-ronment for the control of high-level operators. TheMining Suite is a complete redesign of our previousframework MIRtoolbox [1] – initially focused on thedomain of Music Information Retrieval – that canbe applied to extra-musical domains as well.

This software tool is the product of an ongoing

research project presented in section 1 related tothe extraction of three layers of content from au-dio and music. The comprehensive scope of the in-vestigation requires a highly modular methodologi-cal framework, which is discussed in section 2. Themain characteristics of the framework are describedin section 3 and examples of operators and their ap-plications are given in section 4. Particular memorymanagement capabilities are discussed in section 5

Page 2: MiningSuite paper presented at 130th AES Convention, London

Lartillot A comprehensive and modular framework for audio content extraction

and concrete application areas are evoked in section6. The framework is offered to the scientific com-munity as an open-source project, as explained insection 7.

1. A COMPREHENSIVE FRAMEWORK

We present a general framework for the analysis ofaudio recordings aimed at the extraction of contentapplied in particular to music analysis that we pro-pose to locate along three layers of representation.

• On the lowest layer, a large range of descrip-tions of sounds (mainly timbral, but also simplerepresentations of rhythm and tonality, for in-stance) are based on standard signal processingoperations (spectral analysis, envelope extrac-tion, etc.), as shown in the “features” part ofTable 1.

• On the middle layer, the signal is structurallyorganised through the emergence of symbolicevents (notes, speech utterance, etc.) as wellas more elaborate structures, such as groupsof events, phrases, structural parts, etc. Inthe current version of our released framework,this corresponds to event detection (eventsand notes in the “structures” part of Table1) and to the use of novelty curve (novelty)based on post-processing of similarity matrix(simatrix). Significant research is being car-ried out for the development and integration ofnew innovative methods of structural analysis.1

• Finally, high-level concepts are inferred, such asmusical genres, speech styles, or other semanticcategories, as well as emotional classes. Currentversion of the framework includes emotion [3,4].

The large scope of this framework (and of the un-derlying research study) is, in our view, generallybeneficial for the problem of content extraction, andeven necessary. Indeed, offering a general overviewof techniques for content extraction enables to build

1The MiningSuite will also includes PatMinr, a packagededicated to the automated detection of pattern repetitionsin symbolic sequences [2], which will be tightly integratedwith the other packages presented in this paper.

a comprehensive state of the art, and fosters the im-provement and combination of approaches. Besides,we support the opinion that detailed understandingof music, in particular requires a careful taking intoconsideration of the complex interdependencies be-tween the various representational dimensions, suchas pitch, rhythm or structure. The proposed frame-work enables a complete modelling of such complexinterdependencies.

Whereas the extraction of low-level features can of-ten reside on purely signal processing methods, thedetermination of middle-level structures and high-level concepts requires an interdisciplinary collabo-ration between psychoacoustics, cognitive science (inparticular, auditory scene analysis, in order to makeexplicit the rules of the emergence of notes, complexpolyphonies and structures), artificial intelligence,social science and neuroscience.2

2. A MODULAR ARCHITECTURE

In the proposed framework, analytical processes areconceived, not as a succession of low-level com-mands, but as a flowchart composed of high-levelcustomisable operators. Corresponding to distinctand clearly defined representations of signal, soundand music, these building blocks can be combined inmany ways and offer a variety of options.

As shown in Table 1, operators are organizedinto packages corresponding to separate domains ofstudy: signal processing (SigMinr), audio analysisand auditory modeling (AudiMinr) and music anal-ysis (MusiMinr).

2.1. A Coherent Integration of Expertises

Certains operators offer separate expertise in eachparticular domain that can be combined:

2For instance, a current collaboration with experts in in-tercultural music studies (in particular, Mondher Ayari, Uni-versity of Strasbourg), music perception and cognition (inparticular, Stephen McAdams, McGill University, and PetriToiviainen, University of Jyvaskyla) has amongst objectiveto reveal the complex interactions between the perceptionof structural aspects of music and the cultural background.This collaborative project, called Creativity, Music, Culture,is funded by the French research agency ANR for the years2011-2013.

AES 130th Convention, London, UK, 2011 May 13–16

Page 2 of 10

Page 3: MiningSuite paper presented at 130th AES Convention, London

Lartillot A comprehensive and modular framework for audio content extraction

SigMinr AudiMinr MusiMinr

featuresinput input input

spectrum spectrum spectrum

filterbank filterbank filterbank

zerocross brightness

flux roughness

stat mfcc

cepstrum pitch pitch

envelope envelope tempo

peaks key, mode

structuresevents events notes

simatrix attack beats

novelty tempo

key, mode

conceptsemotion emotion

Table 1: Overview of operators in The MiningSuite.

• input simply loads the data from an input file,and offers filtering tools to select particular tem-poral regions, particular channels, etc. SigMinraccepts various audio format such as WAV orMP3, whereas MusiMinr can load symbolic mu-sic representations such as MIDI files.

• spectrum displays the distribution of energyalong frequencies. In the SigMinr module,it corresponds to the FFT operation but alsoto the autocorrelation function3; the AudiM-inr module integrates perceptual modeling suchas Terhardt’s outer-ear filtering [5], resonancecurves that emphasize frequencies that are moreeasily perceived [6], auditory scales such as Meland Bark bands and masking effects in criticalbands; the MusiMinr module considers the de-composition of the energy into cents and alongmusical scales, corresponding to chromagramrepresentation.

• filterbank decomposes the temporal signalinto frequency bands. The AudiMinr mod-ule offers particular decompositions that corre-ponds to auditory modelling, such as Gamma-

3Cf. section 4 for some more technical details.

tone [7]. The MusiMinr module adds particularfilterbank decompositions that optimize musi-cally oriented operations, such as note pitch ex-traction.

• envelope extracts the amplitude envelope ofthe signal, with the help of a large range ofoptions such as Hilbert transform, down- andupsampling, various low-pass filters, spectro-gram, wave rectification, logarithm, differentia-tion, etc. AudiMinr offers some auditory mod-els: for instance, the decomposition of the inputsignal into a bank of Gammatone filters, fol-lowed by envelope extraction on each band andsum back; or particular auditory mechanismssuch as the mu-law [8]. The envelope is used inAudiMinr for event detection and in MusiMinrfor beat and tempo estimation and note detec-tion.

• emotion evaluates the emotional content of therecordings along various emotional dimensionsand classes. We built a model based on musicalrecordings [3], but it seems that a part of theemotional content stems from extra-musical au-dio characteristics (timbral, energetic, etc.) soa version for the AudiMinr module is under in-vestigation.

2.2. Advantages of Modularity

The modular conception of signal processing offersparticular advantages:

• The design of new analytical processes can beconceived directly as a succession of high-leveloperators, where low-level minor technical con-siderations can be hidden, saving thus a sig-nificant amount of effort, as many MIRtoolboxusers reported, exempted from developing re-dundant codes by themselves.

• This fusion of pluridisciplinary scientific exper-tise yields rich operators with a large collectionof options.

• This schematism stimulates a modular vision ofanalytic processes and the development of high-level operators. For instance, the integration ofspectral flux required the design of a particu-lar operator for flux operations, that turned

AES 130th Convention, London, UK, 2011 May 13–16

Page 3 of 10

Page 4: MiningSuite paper presented at 130th AES Convention, London

Lartillot A comprehensive and modular framework for audio content extraction

out to be fruitful in the conception of advancedmusical analytical processes.

3. MAIN DESIGN PRINCIPLES

The operators are available via an innovative lan-guage as an additional layer on top of the Matlablanguage. The maximal simplicity of its syntax helpsusers to concentrate on the chain of operations lead-ing to the desired analytical process. The low-leveltechnical considerations can be modified via optionsassociated to those operators, or simply ignored.The output data encapsulates all required technicaldetails, such as sampling rate.

3.1. An Intuitive and Flexible Language

In The MiningSuite syntax, any command consistsof the name of an operator, followed, as arguments,by the name of the input, a file name, with or with-out its extension, for instance:

sig.input(’filename.wav’)

and by keywords related to options, when desired:

sig.input(’filename’,’Sampling’,44100)

An operator can be applied automatically to all validfiles in a given directory by simply writing ’Folder’

as first argument, or even to subdirectory recur-sively, keeping trace of the complete directory struc-ture, by using the ’Folders’ keyword.

The output can be stored in a variable, which canbe send as input to another operator, and so on:

a = sig.input(’myfile’)

b = sig.spectrum(a)

c = sig.cepstrum(b)

Instead of specifying a list of operations, it is possibleto call only the final operator corresponding to thedesired output:

sig.cepstrum(’myfile’)

In order to perform multiple post-processing opera-tions, a single operator can be called several timessuccessively:

a = sig.spectrum(’myfile’)

b = sig.spectrum(a,’Max’,5000)

c = sig.spectrum(b,’Log’)

or only once with all the options enumerated as suc-cessive keywords (in any order):

sig.spectrum(’myfile’,’Max’,5000,’Log’)

By threading successive operations, complexflowcharts returning multiple outputs can be de-signed easily, enabling a factorization of operationsand the suppression of redundant computation:

a = sig.spectrum(’myfile’)

b = audi.brightness(a)

c = audi.mfcc(a)

3.2. Data Decomposition

Various methods for decomposing and recombiningsignals are unified into a single framework.

3.2.1. Frame Decomposition

Frame decomposition is performed by simply addingthe ’Frame’ keyword as an additional argument ofthe operator where this decomposition needs to beperformed. For instance when simply calling oneoperator:

musi.tempo(’myfile’,’Frame’)

In this case, the decomposition follows the defaultframe configuration associated to tempo estimation.These parameters can be changed as well. For in-stance, for a frame size of 3 seconds and a hop factorof 1 second:

musi.tempo(’myfile’,’Frame’,3,’s’,1,’s’)

The decomposition process is implicitly integratedinto the operators at the most suitable places. Forinstance, analyzing the temporal evolution of temporequires a frame decomposition after the envelopeextraction.

AES 130th Convention, London, UK, 2011 May 13–16

Page 4 of 10

Page 5: MiningSuite paper presented at 130th AES Convention, London

Lartillot A comprehensive and modular framework for audio content extraction

Any operator called with an input argument that hasbeen previously frame decomposed is automaticallyapplied to each frame separately. Hence the previ-ous operation is roughly equivalent to the followingscript:

a = audi.envelope(’myfile’,...

’Frame’,3,’s’,1,’s’)

b = musi.tempo(a)

It is also possible to combine several layers of framedecomposition, where the large frames (for instance5-second long are further) are decomposed into smallframes (let’s say via a .1-second long and half over-lapped spectrogram):

a = sig.input(’myfile’,’Frame’,5,’s’)

b = sig.spectrum(a,...

’Frame’,.1,’s’,.05,’s’)

c = sig.flux(b)

d = sig.stat(c)

In that example, the statistics (stat) are computedwithin each large frame separately.

Equivalently the large frames can result from a re-combination of small frames:

a = sig.spectrum(’myfile’,’Frame’)

b = sig.flux(a)

c = sig.stat(b,’Frame’,5,’s’)

3.2.2. Channel Decomposition

Any operator called with an input argument that hasbeen previously decomposed into separate channelsusing filterbank is automatically applied to eachchannel separately.

a = audi.filterbank(’myfile’)

b = audi.envelope(a)

c = sig.sum(b)

In the following example, each channel contains in-dependent notes, such that any further operator willbe applied to each note of each channel separely. Forinstance, the pitch content can be estimated for each

separate note of the previous example, and the resultcan be save in a multi-channel MIDI file.

a = audi.filterbank(’myfile’)

b = musi.note(a)

c = musi.pitch(b)

musi.save(c,’output.mid’)

3.3. Multi-Layer Representation

3.3.1. Symbolic Inference

The MiningSuite includes a set of analytical toolsthat highlight and select particular points in the in-put signal:

• peaks features a large range of peak pickingoptions. They have usually been developedfor particular domains (periodicity analysis, en-velope extraction, spectral analysis, etc.), buttheir integration into a single interdisciplinarymodule enables once again a productive share ofknowledge between communities of research. Inparticular, peaks offers the possibility of track-ing peaks along successive frames4.

• events highlights particular temporal regionscorresponding to candidate events, based onvarious strategies (silence, local discontinuities,novelty, etc.). A complex event can be decom-posed into a series of sub-events.

• Whereas events is a general operator inte-grated in AudiMinr, notes is a specializationin MusiMinr for the detection and characteri-zation of elementary musical events. A note canbe decomposed into sub-events, correspondingfor instance to attack phase, sustain phase andrelease phase.

3.3.2. Symbolic Layers

This complex network of information produced bythese analytical operators forms symbolic layer(s)superposed on top of the original audio layer. Thesesymbols (events, notes, etc.) point to particular tem-poral regions of the signal and are described by a setof parameters.

In the musical context, this layers can be organizedfor instance in the following way:

4Cf. section 4 for some more technical details.

AES 130th Convention, London, UK, 2011 May 13–16

Page 5 of 10

Page 6: MiningSuite paper presented at 130th AES Convention, London

Lartillot A comprehensive and modular framework for audio content extraction

• A note layer containing note events. A MIDIfile, for instance, once loaded using musi.input,is translated into a symbolic layer containingnotes, without any audio layer underneath. Thesymbolic layer can be either a simple successionof events, or a far more complex graph whereevents enter into relations of temporal succes-sion and/or superposition, and where branchesrepresent channels of variable durations. In thenote layer, this correspond to the musical con-cept of polyphony, in its most general sense.

• A metric layer containing a hierarchy of beat,i.e., pulsations.

• A structure layer containing a complex con-figuration – not necessarily hierarchical – ofstructures encompassing notes and structures ofstructures, etc.

3.3.3. Articulation Signal/Symbolic Layers

A large range of signal processing methods includedinto the framework are generalized to this heteroge-nous and multi-modal data representation. In MIR-toolbox, audio could be segmented into a sequences ofsuccessive non-overlapping parts, onto which furtheranalyses could be carried out automatically. In TheMiningSuite, this is generalized by allowing not onlysegments but any construction of symbolic events,related to particular temporal regions of the inputsignal. For instance, musical recording can be an-alyzed by focusing particularly on the parts of thesignal related to actual notes, and discarding tran-sient events.

This data structure also allows tight interconnec-tions between several input signals, such as a purelysymbolic representation of a piece of music (a score)and one or several recordings of the same piece. Theinterconnection can consist of an alignment of thenote or metric layer of each input signal. This willenable for instance the study of temporal fluctua-tion in various musical performance, or more generalcomparisons, and advanced analyses of the audio in-put guided by the referential score.

3.4. Observation of the Results

Any result of the analysis process (the final outputor any intermediary step) can be immediately visu-alized by a graphical display of its content within a

dedicated figure window. A large range of outputcan also be directly sonified using a play method.This enables to hear in particular the successiveframes of a moving window, the separate channels ofa filterbank decomposition, the successive segmentsof a segmentation, the shape of an envelope (modu-lating a white-noise), as well as pitch, beats, chroma-gram, etc. A database can also be quickly browsedin the form of audio snippets played in ascendingorder of a specified extracted feature.

Statistical and data-mining post-processing opera-tions can also be applied directly, using dedicatedoperators or through exportation to other software.

3.5. Quality Checking

Each operator in The MiningSuite keeps a recordof the list of operations subsequently performed onthe input. In this way, it is possible for each resultthat has been stored to trace back the complete listof operations, for verification and quality checkingpurpose.

As part of the development quality requirements,each operator in the The MiningSuite includes arange of tests checking whether the operator is usedcoherently, and reasonably connected to the wholeflowchart. Tests also check that the input data ful-fill particular requirements ensuring the good qual-ity of the analysis. Warning and error messages aredisplayed if the quality of the result cannot be as-sured. For instance, tonal analysis in music requireda sufficiently high frequency resolution of the spec-tral decomposition on a particular range of frequen-cies. If the input data does not meet this constraint,compromises are automatically made and a specificwarning message is displayed.

4. EXAMPLES OF OPERATORS

4.1. Examples of Signal Processing Operators

The computing of autocorrelation function (onemethod integrated in the spectrum operator) canbenefit from improvements developed in various ar-eas. Side-border distortion can be suppressed by di-viding the autocorrelation with the autocorrelationof its window (preferably Hanning) [9]. A magnitudecompression of the amplitude decreases the width ofthe peaks in the autocorrelation curve, suitable for

AES 130th Convention, London, UK, 2011 May 13–16

Page 6 of 10

Page 7: MiningSuite paper presented at 130th AES Convention, London

Lartillot A comprehensive and modular framework for audio content extraction

Fig. 1: Peak picking and tracking.

instance for multi-pitch extraction [10]. The sub-harmonics implicitly included in the autocorrelationfunction can be tentatively suppressed on the wave-rectified output by substracting time-scaled versionsof the output. The standard normalization of auto-correlation function (that forces a value of 1.0 atzero lag) for a multi-channel input requires a takinginto account of the relative amount of energy alongchannels.

Peak picking (peaks operator) can use an adaptivethreshold: a given local maximum will there be con-sidered as a peak if its distance with the previousand successive local minima (if any) is higher thana specified threshold, expressed with respect to thetotal amplitude of the input signal. This methodsproves to offer quite reliable results in many applica-tions, and has been used extensively in MIRtoolbox.It is also possible for instance to automatically ex-tract the lobe of each peak and compute statisticalmoments (centroid, spread, etc.) for each of thesepeaks separately. In frame-decomposed data, peakscan be tracked along time, by connecting successivepeaks that are sufficiently aligned (cf. Figure 1).This method, initially developed for speech analy-sis [11], has been used in MIRtoolbox initially forthe tracking of spectral harmonics. We recently im-proved the method by allowing gaps between succes-sive peaks, and generalized to the creation of wholegraph of connections between peaks.

A signal can be automatically segmented into a se-

tem

pora

l loca

tion

of fr

ame

cent

ers

(in s

.)

temporal location of frame centers (in s.)

Similarity matrix

2 4 6 8 10 12 14

2

4

6

8

10

12

14

0 5 10 150

0.5

1Novelty

Temporal location of events (in s.)

coef

ficie

nt v

alue

Fig. 2: Similarity matrix and corresponding noveltycurve.

ries of homogeneous sections (which is consideredas one possible kind of events) through the esti-mation of discontinuities along temporal evolutionof particular features. This is estimated as fol-lows: First of all, feature-based distances betweenall possible frame pairs are stored in a similaritymatrix (simatrix, cf. Figure 2). Convolution alongthe main diagonal of the matrix using a Gaussiancheckerboard kernel yields a novelty curve that in-dicates the temporal locations of significant texturalchanges [12]. Peak detection returns the temporalposition of feature discontinuities that can be usedfor the actual segmentation of the audio sequence.

4.2. Examples of Auditory Modeling Applications

Timbral characterization of sounds include the de-scription of attack of events (for instance throughthe computation of the attack slope in the enve-lope curve, cf. Figure 3), of sound brightness [13],the computation of Mel-Frequency Cepstral Coef-ficients (mfcc), the characterization of roughness

based on beating effects between components closein frequency [14], etc.

4.3. Example of Music Analysis Applications

Concerning rhythmic analysis, tempo is estimatedthrough the estimation of periodicities in the ampli-tude envelope curve5. Particular frequency regions

5This envelope curve can be computed both from audiosignal and note representation [6].

AES 130th Convention, London, UK, 2011 May 13–16

Page 7 of 10

Page 8: MiningSuite paper presented at 130th AES Convention, London

Lartillot A comprehensive and modular framework for audio content extraction

0 1 2 3 4 5 60

0.5

1Onset curve (Envelope)

time (s)

ampl

itude

Fig. 3: Envelope curve with events and their relatedattack phases.

more easily perceived are emphasized by applying acognitively-based resonance curve. The same anal-ysis can be performed directly from symbolic databy summing Gaussian kernels located at the onsetpoints of each note, the height of each Gaussian ker-nel being proportional to the duration of the respec-tive note. In both cases, diverse descriptions of theresulting autocorrelation curve (global minimum ormaximum, kurtosis of main peak, etc.) lead to anassessment of pulse clarity[15].

Concerning tonal analysis in the audio domain,the spectrum is converted from the frequency do-main to the pitch domain by applying a log-frequency transformation. This distribution of en-ergy along pitches, called “chromagram”, and in-cluded in musi.spectrum, is then wrapped, showingthe distribution of energy with respect to the twelvepossible pitch classes (C, C#, D, etc.). The tonalityis assessed by comparing, through cross-correlation,the chromagram to a theoretical pitch distributionassociated to each possible tonality [16]. The mostprevalent tonality is considered to be the key candi-date. A richer representation of the tonality estima-tion can be drawn with the help of a self-organizingmap (SOM), trained by the 24 tonal profiles. Keyis estimated by projecting its wrapped chromagramonto the SOM [17] (cf. figure 4).

5. MEMORY MANAGEMENT

Datasets of any size can be processed thanks to im-plicit memory management mechanisms.

5.1. Dataflow Design and Evaluation

The input of an operator, or of a series of operations,or of more complex flowcharts, should not necessar-ily be connected to one particular input data (file,set of files, etc.), but can be stored as an abstract

Self!organizing map projection of chromagram

C

Db

D

Eb

E

F

Gb

G

Ab

A

Bb

B

c

c#

d

d#

e

f

f#

g

ab

a

bb

b

Fig. 4: Self-organized map related to key structures.

flowchart design, which can be subsequently appliedto any input data afterwards. The abstract flowchartis progressively constructed using the same minimal-ist syntax presented in paragraph 3.1. The only dif-ferent is that instead of specifying a specific inputfile, the keyword ’Design’ is used. For instance:

a = sig.spectrum(’Design’,’Frame’)

b = sig.struct

b.brightness = audi.brightness(a)

b.mfcc = audi.mfcc(a)

The flowchart (here, b) can then be evaluated onparticular file or folder(s) of files:

sig.eval(b,’Folders’)

One particular interest (and importance) of this ap-proach is to allow the application of such flowcharton large set of audio files, or on long single audiofiles. The decomposition of the input into chunkof reasonable size, as well as the recombination ofthe results for these successive chunks, are auto-matically taken into consideration by the underlyingcode. Thanks to these implicit memory managementprocesses, dataset of any size can be processed au-tomatically.

5.2. Architecture Details

The adaptiveness of the operator syntax, the accep-tation of various input types as well as the capa-bility to analyze big databases and long audio fileswithout memory overflow, impose a clarification andunification of the operator code structure, which aredivided into three main phases:

AES 130th Convention, London, UK, 2011 May 13–16

Page 8 of 10

Page 9: MiningSuite paper presented at 130th AES Convention, London

Lartillot A comprehensive and modular framework for audio content extraction

• An initialization phase connects the given op-erator to the general dataflow: it unrolls theseries of preliminary operators implicited calledby the operator before the specific set of op-erations strictly related to the operator itself.For instance, tempo(’myfile’) unrolls the listof preliminary operations necessary for the keyextraction: envelope, spectrum, etc.

• The core operations perform the essential im-portant aspects related to the operator, whichgenerally reduces the amount of data availablefrom the input, and output the data in the ex-pected new format.

• The post-processing operations apply a certainnumber of operations on the output data.

This distinction is all the more important when con-sidering the analysis of long audio files, which re-quires the decomposition of the input data into suc-cessive chunks:

• The initialization enables to draw the completedataflow, which is used as basis for specifyingthe characteristics of the chunk decomposition,knowing the underlying memory requirementsand the available resources.

• Each successive chunk is passed to the core op-erations, where significant data reductions usu-ally take place. The output can then be re-combined chunk after chunk, through eitherconcatenation, summation, averaging, etc., de-pending on the operator, forming a single out-put signal in the end.

• The post-processing operations can then be per-formed directly on the whole data.

The proposed management of memory enables toavoid the use of temporary files, which remains how-ever available for the most demanding cases. Forexample, the possible use of zero-phase filtering inenvelope extraction requires two scans of the inputdata, in both directions. Advanced techniques havebeen added to the framework, such as the correctionof rounding errors that could happen if resamplingis performed in each chunk separately.

6. APPLICATION AREAS

Three main applicational areas are considered.

6.1. Perceptive and Cognitive Modeling

First, in a purely scientific point of view, the re-sulting framework offers a detailed explanation ofperceptive and cognitive mechanisms underlying theperception of content from audio.

6.2. Pedagogical Tool

Secondly, the resulting tool has a large range of ped-agogical potentials (as we already experienced in theprevious version of our platform, MIRtoolbox, withthousands of users from the academy world), offer-ing a very intuitive language for the experimentationof signal processing and the extraction and visuali-sation of content of all kinds. This technology offersnot only the possibility of understanding the natureof those types of content and the way they are ex-tracted, but it also offers rich perspectives of digitallibraries along those selected dimensions of represen-tation under interest, such as musical genres, emo-tions, musical concepts, etc.

6.3. Digital Library Management

A third application area concerns the use of the pro-posed framework for intelligent browsing and struc-turing of digital library, information retrieval, andthe design of content-based audio interfaces.

7. THE MININGSUITE

The proposed framework, called The MiningSuite,is offered to the signal processing community as anopen-source project with source repository, discus-sion lists for users, developers, commits, wiki pages,etc.6 The Matlab environment offers advanced ca-pabilities: Currently, analyses can be performed inparallel on different processor cores and on clusterof computers, GPUs can be used as well. In futureworks, we plan real-time versions, platform-specificcompiled versions, web services, etc.

The MiningSuite being an open-source project, thecode source related to all these operators is freely

6The project website can be accessed at the following ad-dress http://code.google.com/p/miningsuite

AES 130th Convention, London, UK, 2011 May 13–16

Page 9 of 10

Page 10: MiningSuite paper presented at 130th AES Convention, London

Lartillot A comprehensive and modular framework for audio content extraction

available. A set of development quality require-ments encourages in particular to consider the build-ing block represented by each operator as an actualwhite box, where the complete code is clearly struc-tured and sufficiently commented in order to fosterclose collaboration between developers and users viathe open source community network. This ensuresa better control of code errors and a faster develop-ment of new features.

A Software Development Kit offers users the possi-bility of developing their own operators: metafunc-tions hide all the aforementioned complex mecha-nisms, so that operators can be designed and codedusing very simple templates.

8. REFERENCES

[1] O. Lartillot and P. Toiviainen, “A Matlab tool-box for musical feature extraction from audio,”presented at the 10th International Conferenceon Digital Audio Effects, Bordeaux, France,2007 September 10–15

[2] O. Lartillot, “Multi-dimensional motivic pat-tern extraction founded on adaptive redun-dancy filtering,” Journal of New Music Re-search, 34 (2005), no. 4, 375–393

[3] T. Eerola, O. Lartillot and P. Toiviainen, “Pre-diction of Multidimensional Emotional Ratingsin Music From Audio Using Multivariate Re-gression Models,” presented at the 10th Inter-national Conference on Music Information Re-trieval, Kobe, Japan, 2009 October 26-30

[4] P. Saari, T. Eerola and O. Lartillot, “Gen-eralizability and simplicity as criteria in fea-ture selection: Application to mood classifi-cation in music,” IEEE Transactions on Au-dio, Speech, and Language Processing, in press,TASL.2010.2101596

[5] E. Terhardt, “Calculating virtual pitch,” Hear-ing Research, 1 (1979), 55–182

[6] P. Toiviainen and J. Snyder, “Tapping to Bach:Resonance-based modeling of pulse,” MusicPerception, 21 (2003), no. 1, 43–80

[7] R. D. Patterson et al, “Complex sounds andauditory images,” in Auditory Physiology and

Perception, edition by Y. Cazals et al, Oxford,1992, 429–446

[8] A. Klapuri, A. Eronen and J. Astola, “Anal-ysis of the meter of acoustic musical signals,”IEEE Transactions on Audio, Speech and Lan-gage Processing, 14 (2006), no. 1, 342– 355

[9] P. Boersma, “Accurate short-term analysis ofthe fundamental frequency and the harmonics-to-noise ratio of a sampled sound,” Institute ofPhonetic Sciences Proceedings 17 (1993), 97–110

[10] T. Tolonen and M. Karjalainen, “A Compu-tationally efficient multipitch analysis model,”IEEE Transactions on Speech and Audio Pro-cessing, 8 (2000), no. 6, 2000, 708–716

[11] R. McAulay and T. Quatieri,“Speech analy-sis/Synthesis based on a sinusoidal representa-tion,” IEEE Transactions on Acoustics, Speechand Signal Processing,34 (1996), no. 4, 744–754

[12] J. Foote and M. Cooper, “Media Segmentationusing Self-Similarity Decomposition,” Storageand Retrieval for Multimedia Databases, SPIEProceedings 5021, 167–75. 2003

[13] P. N. Juslin, “Cue utilization in communicationof emotion in music performance: relating per-formance to perception,” Journal of Experimen-tal Psychology: Human Perception and Perfor-mance, 26 (2000), no. 6,, 1797–1813

[14] W. A. Sethares, Tuning, Timbre, Spectrum,Scale, Springer-Verlag, 1998

[15] O. Lartillot, T. Eerola, P. Toiviainen and J.Fornari, “Multi-feature modeling of pulse clar-ity: Design, validation, and optimization,” pre-sented at the 9th International Conference onMusic Information Retrieval, Philadelphia, US,2008 September 14-18.

[16] E. Gomez, Tonal description of music audiosignal. Phd thesis, Universitat Pompeu Fabra,Barcelona, 2006.

[17] P. Toiviainen and K. Krumhansl, “Measuringand modeling real-time responses to music: Thedynamics of tonality induction,” Perception 32,no. 6 (2003), 741–766

AES 130th Convention, London, UK, 2011 May 13–16

Page 10 of 10