14
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 2, FEBRUARY 2011 301 Controlling the Perceived Material in an Impact Sound Synthesizer Mitsuko Aramaki, Member, IEEE, Mireille Besson, Richard Kronland-Martinet, Senior Member, IEEE, and Sølvi Ystad Abstract—In this paper, we focused on the identification of the perceptual properties of impacted materials to provide an intuitive control of an impact sound synthesizer. To investigate such prop- erties, impact sounds from everyday life objects, made of different materials (wood, metal and glass), were recorded and analyzed. These sounds were synthesized using an analysis–synthesis tech- nique and tuned to the same chroma. Sound continua were created to simulate progressive transitions between materials. Sounds from these continua were then used in a categorization experiment to determine sound categories representative of each material (called typical sounds). We also examined changes in electrical brain ac- tivity (using event related potentials (ERPs) method) associated with the categorization of these typical sounds. Moreover, acoustic analysis was conducted to investigate the relevance of acoustic de- scriptors known to be relevant for both timbre perception and ma- terial identification. Both acoustic and electrophysiological data confirmed the importance of damping and highlighted the rele- vance of spectral content for material perception. Based on these findings, controls for damping and spectral shaping were tested in synthesis applications. A global control strategy, with a three- layer architecture, was proposed for the synthesizer allowing the user to intuitively navigate in a “material space” and defining im- pact sounds directly from the material label. A formal perceptual evaluation was finally conducted to validate the proposed control strategy. Index Terms—Analysis–synthesis, control, event related poten- tials, impact sounds, mapping, material, sound categorization, timbre. I. INTRODUCTION T HE current study describes the construction of a synthe- sizer dedicated to impact sounds that can be piloted using high-level verbal descriptors referring to material categories (i.e., wood, metal and glass). This issue is essential for sound design and virtual reality where sounds coherent with visual scenes are to be constructed. Control strategies for synthesis Manuscript received April 27, 2009; revised October 08, 2009; accepted March 18, 2010. Date of publication April 08, 2010; date of current version October 27, 2010. This work was supported by the Human Frontier Science Program under Grant HFSP #RGP0053 to M. Besson and a grant from the French National Research Agency (ANR, JC05-41996, “senSons”) to S. Ystad. The work of M. Aramaki was supported by a postdoctoral grant, first from the HFSP and then from the ANR. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Gaël Richard. M. Aramaki and M. Besson are with CNRS-Institut de Neurosciences Cognitives de la Méditerranée, 13402 Marseille Cedex 20 France, and also with Aix-Marseille-Université, 13284 Marseille Cedex 07 France (e-mail: [email protected]; [email protected]). R. Kronland-Martinet and S. Ystad are with the CNRS-Laboratoire de Mécanique et d’Acoustique, 13402 Marseille Cedex 20 France (e-mail: [email protected]; [email protected]). Digital Object Identifier 10.1109/TASL.2010.2047755 (also called mapping) is an important issue that has interested the computer music community ever since it became possible to produce music with computers [1]. A large number of interfaces and control strategies have been proposed by several authors [2]–[10]. Most of these interfaces were designed for musical purposes and are generally not adapted to build environmental sounds used in sound design and virtual reality. As opposed to music-oriented interfaces that generally focus on the control of acoustic factors such as pitch, loudness, or rhythmic deviations, a more intuitive control based on verbal descriptors that can be used by non-experts is needed in these new domains. This issue requires knowledge on acoustical properties of sounds and how they are perceived. As a first approach towards the design of such an environmental sound synthesizer, we focus on the class of impact sounds and on the control of the perceived material. In particular, our aim is to develop efficient mapping strategies between words referring to certain material categories (i.e., wood, metal and glass) and signal parameters to allow for an intuitive sound synthesis based on a smaller number of control parameters. To point out perceptual properties that characterize the cate- gories, a listening test was conducted. Stimuli were created first by recording impact sounds from everyday life objects made of different materials. Then, these recorded sounds were synthe- sized by analysis–synthesis techniques and tuned to the same chroma. Finally, we created continua from the tuned sounds to simulate progressive transitions between the categories by in- terpolating signal parameters. The use of sound continua was of interest to closely investigate transitions and limits between ma- terial categories. Sounds from these continua were used in a cat- egorization task so as to be classified by participants as Wood, Metal, or Glass. From the percentage of responses, we deter- mined sound categories representative of each material (called sets of typical sounds). Then, we examined the acoustic characteristics that differ across typical Wood, Metal, and Glass sounds. For this purpose, we considered acoustic descriptors known to be relevant both for timbre perception and for material identification. Previous studies on the perception of sound categories have mainly been based on the notion of timbre. Several authors have used dissim- ilarity ratings to identify timbre spaces in which sounds from different musical instruments can be distinguished [11]–[14]. They found correlations between dimensions of these timbre spaces and acoustic descriptors such as attack time (the way the energy rises at the sound onset), spectral bandwidth (spectrum spread), or spectral centroid (center of gravity of the spectrum). More recently, roughness (distribution of interacting frequency components within the limits of a critical band) was considered 1558-7916/$26.00 © 2010 IEEE

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...€¦ · concept of consonance in a musical context [15], [16]. In the case of impact sounds, the perception of material seems

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...€¦ · concept of consonance in a musical context [15], [16]. In the case of impact sounds, the perception of material seems

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 2, FEBRUARY 2011 301

Controlling the Perceived Material inan Impact Sound Synthesizer

Mitsuko Aramaki, Member, IEEE, Mireille Besson, Richard Kronland-Martinet, Senior Member, IEEE, andSølvi Ystad

Abstract—In this paper, we focused on the identification of theperceptual properties of impacted materials to provide an intuitivecontrol of an impact sound synthesizer. To investigate such prop-erties, impact sounds from everyday life objects, made of differentmaterials (wood, metal and glass), were recorded and analyzed.These sounds were synthesized using an analysis–synthesis tech-nique and tuned to the same chroma. Sound continua were createdto simulate progressive transitions between materials. Sounds fromthese continua were then used in a categorization experiment todetermine sound categories representative of each material (calledtypical sounds). We also examined changes in electrical brain ac-tivity (using event related potentials (ERPs) method) associatedwith the categorization of these typical sounds. Moreover, acousticanalysis was conducted to investigate the relevance of acoustic de-scriptors known to be relevant for both timbre perception and ma-terial identification. Both acoustic and electrophysiological dataconfirmed the importance of damping and highlighted the rele-vance of spectral content for material perception. Based on thesefindings, controls for damping and spectral shaping were testedin synthesis applications. A global control strategy, with a three-layer architecture, was proposed for the synthesizer allowing theuser to intuitively navigate in a “material space” and defining im-pact sounds directly from the material label. A formal perceptualevaluation was finally conducted to validate the proposed controlstrategy.

Index Terms—Analysis–synthesis, control, event related poten-tials, impact sounds, mapping, material, sound categorization,timbre.

I. INTRODUCTION

T HE current study describes the construction of a synthe-sizer dedicated to impact sounds that can be piloted using

high-level verbal descriptors referring to material categories(i.e., wood, metal and glass). This issue is essential for sounddesign and virtual reality where sounds coherent with visualscenes are to be constructed. Control strategies for synthesis

Manuscript received April 27, 2009; revised October 08, 2009; acceptedMarch 18, 2010. Date of publication April 08, 2010; date of current versionOctober 27, 2010. This work was supported by the Human Frontier ScienceProgram under Grant HFSP #RGP0053 to M. Besson and a grant from theFrench National Research Agency (ANR, JC05-41996, “senSons”) to S. Ystad.The work of M. Aramaki was supported by a postdoctoral grant, first from theHFSP and then from the ANR. The associate editor coordinating the review ofthis manuscript and approving it for publication was Prof. Gaël Richard.

M. Aramaki and M. Besson are with CNRS-Institut de NeurosciencesCognitives de la Méditerranée, 13402 Marseille Cedex 20 France, and alsowith Aix-Marseille-Université, 13284 Marseille Cedex 07 France (e-mail:[email protected]; [email protected]).

R. Kronland-Martinet and S. Ystad are with the CNRS-Laboratoire deMécanique et d’Acoustique, 13402 Marseille Cedex 20 France (e-mail:[email protected]; [email protected]).

Digital Object Identifier 10.1109/TASL.2010.2047755

(also called mapping) is an important issue that has interestedthe computer music community ever since it became possible toproduce music with computers [1]. A large number of interfacesand control strategies have been proposed by several authors[2]–[10]. Most of these interfaces were designed for musicalpurposes and are generally not adapted to build environmentalsounds used in sound design and virtual reality. As opposed tomusic-oriented interfaces that generally focus on the control ofacoustic factors such as pitch, loudness, or rhythmic deviations,a more intuitive control based on verbal descriptors that can beused by non-experts is needed in these new domains. This issuerequires knowledge on acoustical properties of sounds and howthey are perceived. As a first approach towards the design ofsuch an environmental sound synthesizer, we focus on the classof impact sounds and on the control of the perceived material.In particular, our aim is to develop efficient mapping strategiesbetween words referring to certain material categories (i.e.,wood, metal and glass) and signal parameters to allow for anintuitive sound synthesis based on a smaller number of controlparameters.

To point out perceptual properties that characterize the cate-gories, a listening test was conducted. Stimuli were created firstby recording impact sounds from everyday life objects made ofdifferent materials. Then, these recorded sounds were synthe-sized by analysis–synthesis techniques and tuned to the samechroma. Finally, we created continua from the tuned sounds tosimulate progressive transitions between the categories by in-terpolating signal parameters. The use of sound continua was ofinterest to closely investigate transitions and limits between ma-terial categories. Sounds from these continua were used in a cat-egorization task so as to be classified by participants as Wood,Metal, or Glass. From the percentage of responses, we deter-mined sound categories representative of each material (calledsets of typical sounds).

Then, we examined the acoustic characteristics that differacross typical Wood, Metal, and Glass sounds. For this purpose,we considered acoustic descriptors known to be relevant bothfor timbre perception and for material identification. Previousstudies on the perception of sound categories have mainly beenbased on the notion of timbre. Several authors have used dissim-ilarity ratings to identify timbre spaces in which sounds fromdifferent musical instruments can be distinguished [11]–[14].They found correlations between dimensions of these timbrespaces and acoustic descriptors such as attack time (the way theenergy rises at the sound onset), spectral bandwidth (spectrumspread), or spectral centroid (center of gravity of the spectrum).More recently, roughness (distribution of interacting frequencycomponents within the limits of a critical band) was considered

1558-7916/$26.00 © 2010 IEEE

Page 2: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...€¦ · concept of consonance in a musical context [15], [16]. In the case of impact sounds, the perception of material seems

302 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 2, FEBRUARY 2011

as a relevant dimension of timbre since it is closely linked to theconcept of consonance in a musical context [15], [16]. In thecase of impact sounds, the perception of material seems mainlyto correlate with the frequency-dependent damping of spectralcomponents [17], [18] ([19], [20] in the case of struck bars), dueto various loss mechanisms. Interestingly, damping remains arobust acoustic descriptor to identify macro-categories (i.e., be-tween wood–Plexiglas and steel–glass categories) across vari-ations in the size of objects [21]. From an acoustical point ofview, a global characterization of the damping can be given bythe sound decay measuring the decrease in sound energy as afunction of time.

The above-mentioned timbre descriptors, namely attack time,spectral bandwidth, roughness, and normalized sound decay,were considered as potentially relevant signal features for thediscrimination between sound categories. An acoustic analysiswas conducted on these descriptors to investigate their rele-vance. At this stage, it is worth mentioning that signal descrip-tors that are found to be significant in traditional timbre studiesmay not be directly useful in the case of sound synthesis andcontrol. Some descriptors might not give access to a sufficientlyfine control of the perceived material. It might be necessary toact on a combination of descriptors. To more deeply investi-gate perceptual/cognitive aspects linked to the sound categoriza-tion, we exploited electrophysiological measurements for syn-thesis purposes since they provide complementary informationregarding the nature of sound characteristics that contribute tothe differentiation of material categories from a perceptual/cog-nitive point of view. In particular, we examined changes in brainelectrical activity [using event related potentials (ERPs)] asso-ciated with the perception and categorization of typical sounds(we refer the reader to a related article for more details [22]).

Based on acoustic and electrophysiological results, soundcharacteristics relevant for an accurate evocation of materialwere determined and control strategies related to physical andperceptual considerations were proposed. The relevance ofthese strategies in terms of an intuitive manipulation of param-eters was further tested in synthesis applications. High-levelcontrol was achieved through a calibration process to determinethe range values of the damping parameters specific to eachmaterial category. In particular, the use of sound continuain the categorization experiment highlighted transition zonesbetween categories that allowed for continuous control betweendifferent materials.

The paper is organized as follows: first the sound catego-rization experiment with stimuli construction and results is pre-sented. Statistical analyses are further carried out on the set ofsounds defined as typical to determine the acoustic descriptorsthat best discriminate sound categories. Then, sound character-istics that are relevant for material perception are obtained fromphysical considerations, timbre investigations and electrophysi-ological measurements. Control strategies allowing for an intu-itive manipulation of these sound characteristics, based on thesefindings and on our previous works [23]–[25], are proposed ina three-layer control architecture providing the synthesis of im-pact sounds directly from the material label. A formal perceptualevaluation of the proposed control strategy is finally presented.

II. SOUND CATEGORIZATION EXPERIMENT

A. Participants

Twenty-five participants (13 women and 12 men, 19 to 35years old, mean age ) were tested in this experimentthat lasted for about one hour. They were all right-handed, non-musicians (no formal musical training), had normal audition andno known neurological disorders. They all gave written consentand were paid to participate in the experiment.

B. Stimuli

We first recorded 15 sounds by impacting everyday life ob-jects made of three different materials (wooden beams, metallicplates, glass bowls) that are five sounds per material. Syntheticversions of these recorded sounds were generated by an anal-ysis–synthesis process and tuned to the same chroma. Then, wecreated -step sound continua that simulate progressive tran-sitions between two sounds of different materials by acting onamplitudes and damping parameters. The different stages of thestimuli construction are detailed below.

1) Analysis–Synthesis of Natural Sounds: Recordings of nat-ural sounds were made in an acoustically treated studio of thelaboratory using a microphone placed 1 m from the source. Theobjects from different materials were impacted by hand. Wetried to control the impact on the object by using the same drum-stick and the same impact force. The impact position on the dif-ferent objects was chosen so that most modes were excited (nearthe center of the object for wooden beams and metallic plates;near the rim for glass bowls). Sounds were digitally recorded at44.1-kHz sampling frequency.

From a physical point of view, the vibrations of an impactedobject (under free oscillations) can generally be modeled as asum of exponentially damped sinusoids:

(1)

where is the Heaviside function and the parameters ,, , and , the amplitude, damping coefficient, fre-

quency, and phase of the th component, respectively. Basedon the signal model corresponding to (1), we synthesizedthe recorded sounds at the same sampling frequency. Severaldifferent techniques allow precise estimating the signal pa-rameters based on high-resolutionanalysis such as the Steiglitz–McBride technique [26] or morerecently Estimation of Signal Parameters via Rotational In-variance Techniques (ESPRIT), MUltiple SIgnal Classification(MUSIC), Least Squares or Maximum-Likehood techniques[27]–[30] (see also [31], [32]). These latter methods providean accurate estimation and can be used to conduct spectralanalysis. We here used a simplified analysis technique basedon discrete Fourier transform (DFT) since we aimed at re-producing the main characteristics of the original sounds interms of perceived material rather than achieving a perfectresynthesis.

The number of components to synthesize was estimatedfrom the modulus of the spectral representation of the signal.

Page 3: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...€¦ · concept of consonance in a musical context [15], [16]. In the case of impact sounds, the perception of material seems

ARAMAKI et al.: CONTROLLING THE PERCEIVED MATERIAL IN AN IMPACT SOUND SYNTHESIZER 303

Only the most prominent components, which amplitudes werelarger than a threshold value fixed at 30 dB below the max-imum amplitude of the spectrum, were synthesized. In addition,to keep the broadness of the original spectrum, we made surethat at least the most prominent component in each critical band-width was synthesized. Since Wood and Glass sounds had rela-tively poor spectra (i.e., few components), most of the compo-nents were synthesized. By contrast, Metal sounds had rich andbroadband spectra. Some components were due to the nonlinearvibrations of the impacted object (favored by a low dissipationfor Metal) and could not be reproduced by the signal model thatonly considers linear vibrations. Thus, the number of compo-nents for synthetic Metal sounds were generally inferior to thenumber of components of the original sound.

The frequency values were directly inferred from theabscissa of the local maxima corresponding to the prominentcomponents. Since the spectrum was obtained by computinga fast Fourier transform (FFT) over samples, the fre-quency precision of each component was equal to 0.76 Hz

. Each component was isolated using agaussian window centered on the frequency . The frequencybandwidth of the gaussian window was adapted to numericallyminimize the smoothing effects and to avoid the overlap oftwo successive components which causes interference effects.The gaussian window presents the advantage of preserving theexponential damping when convolved with an exponentiallydamped sine wave. Then, the analytic signal of thewindowed signal was calculated using the Hilbert transformand the modulus of was modeled by an exponentiallydecaying function

(2)

Thus, by fitting the logarithm of with a polynomial func-tion of degree 1 at best in a least-squares sense, the ampli-tude was inferred from the ordinate at the origin while thedamping coefficient was inferred from the slope. Finally,the phases were set to 0 for all components. This choice iscommonly adopted in synthesis processes since it avoids unde-sirable clicks at sound onset. It is worth noticing that this phaseadjustment does not affect the perception of the material becausephase relationships between components mainly reflect the po-sition of the microphone relative to the impacted object.

2) Tuning: The pitches of the 15 synthetic sounds (five permaterial category) differed since they resulted from impactson various objects. Consequently, sounds were tuned to thesame chroma to minimize pitch variations. Tuning is neededto build homogeneous sound continua with respect to pitch(Section II-B4) and to accurately investigate acoustic de-scriptors (Section III). In particular, the relationships betweendescriptors will be better interpreted if they are computed on aset of tuned sounds with normalized pitches rather than on a setof sounds with various pitch values.

We first defined the initial pitch of the sounds from informallistening tests: four participants (different from those whoparticipated in the categorization experiment) listened to eachsound and were asked to evaluate the pitch by playing thematching note on a piano keyboard. For each sound, the pitchwas defined by the note that was most often associated with

the sound. Thus, we defined the pitches (fundamentalfrequency of 415.30 Hz), (277.18 Hz), (369.99 Hz),

, and for the 5 Wood sounds; (440.00 Hz), ,(1174.65 Hz), (659.25 Hz), and (329.62 Hz) for

the 5 Metal sounds, and (1046.50 Hz), (2637.02 Hz),(2217.46 Hz), , and (1396.91 Hz) for the 5 Glass

sounds. Then, we tuned the sounds to the closest note C withrespect to the initial pitch to minimize signal transformationsapplied on the sounds: Wood sounds were tuned to the pitchC3, Metal sounds to C3 and C4 and Glass sounds to C5 andC6. Therefore, sounds differed by 1, 2, or 3 octaves dependingupon the material. Based upon previous results showing highsimilarity ratings for tone pairs that differed by octaves [33],an effect known as the octave equivalence, we assume thatthe octave differences between sounds belonging to a samecategory should have little influence on sound categorization.

In practice, tuned sounds were generated using the previoussynthesis technique [(1)]. The amplitudes and phases of com-ponents were kept unchanged but the frequencies (noted fortuned sounds) and damping coefficients (noted ) were recal-culated as follows. The tuned frequencies were obtained bytransposing original ones with a dilation factor

defined from the fundamental frequency values (in Hz), notedand , of the sound pitches before and after tuning, respec-

tively,

with (3)

The damping coefficient of each tuned component wasrecalculated by taking into account the frequency-dependencyof the damping. For instance, it is known that in case ofwooden bars, the damping coefficients increase with frequencyfollowing an empirical expression of a parabolic form whereparameters depend on the wood species [34]–[36]. To achieveour objectives, we defined a general expression of a dampinglaw chosen as an exponential function

(4)

The exponential expression presents the advantage of easilyfitting various and realistic damping profiles with a reducednumber of parameters. is defined by two parametersand characteristic of the intrinsic properties of the material.The parameter reflects global damping and the parameter

reflects frequency-relative damping (i.e., difference be-tween high-frequency component damping and low-frequencycomponent damping). Thus, a damping law was esti-mated on the original sound by fitting the damping coefficients

with the (4) at best in a least-squares sense.Then, the damping coefficient of the th tuned componentwas recalculated according to this damping law (see also [37])

(5)

3) Gain Adjustment: Sounds were equalized by gain ad-justments to avoid the influence of loudness in the categoriza-tion judgments. The gain adjustments were determined on thebasis of a pretest with four participants (different from thosewho participated in the categorization experiment). They wereasked to balance the loudness level of the tuned sounds. These

Page 4: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...€¦ · concept of consonance in a musical context [15], [16]. In the case of impact sounds, the perception of material seems

304 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 2, FEBRUARY 2011

tuned sounds were previously normalized by a gain of referencewith corresponding to the largest value of the

maxima of the signal modulus among the 15 tuned sounds. Thecoefficient 1.5 is a safety coefficient commonly used in gain ad-justment tests to avoid the saturation of the signals after the ad-justment. The gain values to be applied on the 5 Wood soundswere equal to [70, 20, 30, 15, 30], on the 5 Metal sounds wereequal to [3.5, 1.1, 1, 1.5, 1.3] and on the five Glass sounds wereequal to [35, 15, 15, 30, 10].

Finally, the four participants were asked to evaluate the finalsounds in terms of perceived material. Results showed thatsounds were categorized in the same material category as theoriginal sounds by all participants thereby showing that themain characteristics of the material were preserved.

4) Sound Continua: To closely investigate transitions be-tween material categories, we created 15 -step sound continuanoted with five continua for each material transition. Thefive Wood-Metal continua were indexed from to , thefive Wood–Glass continua from to and finally, the fiveGlass-Metal continua from to . Each continuum wascomposed of 22 hybrid sounds that were obtainedby mixing the spectra and by interpolating the damping lawsof the two extreme sounds. We chose to mix spectra to fix thevalues of the frequency components which allows minimizingpitch variations across sounds within a continuum (it is knownthat shifting components modifies pitch). We chose to interpo-late damping laws to gradually modify the damping that con-veys fundamental information on material perception. Thus, thesound (t) at step of the continuum is expressed by

(6)

where and correspondto the sets of amplitudes and frequencies of the two extremesounds and varies from 1 to 22. The gains and cor-respond to the gains of the extreme sounds defined from thegain adjustment test according to a gain of reference (seeSection II-B2). The gains and vary at each step ona logarithmic scale, according to the dB scale

(7)

The damping variation along the continua is computed by in-terpolating the damping parameters and of the dampinglaw [defined in (4)] estimated on the two extreme sounds (lo-cated at step and , respectively), leading to the de-termination of a hybrid damping law that progressivelyvaries at each step of the continuum (see Fig. 1)

(8)

Fig. 1. Damping laws for as a function of frequencycorresponding to a Wood–Metal continuum. Bold curves correspond to dampinglaws of Wood (in bold plain) and Metal (in bold dashed) sounds at the extremepositions.

with

(9)

The use of an interpolation process on the damping allowedfor a better merging between the extreme sounds since the spec-tral components of the two spectra are damped following thesame damping law . As a consequence, hybrid sounds (inparticular, at centered positions of the continua) differed fromsounds obtained by only mixing the extreme sounds.

The obtained sounds had different signal lengths (Metalsounds are longer than Wood or Glass sounds). To restrain thelengths to a maximum of 2 seconds, sound amplitudes weresmoothly dropped off by multiplying the temporal signal withthe half decreasing part of a Hann window.

A total of 330 sounds were created. The whole set of soundsare available at [38]. The averaged sound duration was 861 msfor all sounds and 1053 ms in the Wood–Metal continua, 449 msin the Wood-Glass continua, and 1081 ms in the Glass–Metalcontinua.

C. Procedure

The experiment was conducted in a quiet Faradized (electri-cally shielded) room. Sounds were presented once (i.e., no rep-etition of the same sound) in random order through one loud-speaker (Tannoy S800) located 1 m in front of the participant.Participants were asked to categorize sounds as Wood, Metal,or Glass, as fast as possible, by pressing one response buttonout of three on a three-buttons response box1 (right, middle, andleft buttons; one button per material category label). The asso-ciation between response buttons and material categories wasbalanced across participants to avoid any bias linked with the

1Since participants were not given the option to choose that sounds did notbelong to either one of these three categories, results may be biased, but thispotential ambiguity would be raised only for intermediate sounds.

Page 5: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...€¦ · concept of consonance in a musical context [15], [16]. In the case of impact sounds, the perception of material seems

ARAMAKI et al.: CONTROLLING THE PERCEIVED MATERIAL IN AN IMPACT SOUND SYNTHESIZER 305

Fig. 2. Percentage of classification as Wood (gray curves), Metal (blackcurves), or Glass (dashed curves) for each sound as a function of its positionon the continuum for the 15 continua . Sounds were considered as typical ifthey were classified in one category by more than 70% of participants (thresholdrepresented by an horizontal line). No data were collected for extreme sounds( and ) since they were used in the training session.

Fig. 3. ERPs to typical sounds of Wood (dotted), Metal (circle marker), andGlass (gray) at electrodes (Fz, Cz, Pz) along the midline of the scalp (see [22]for lateral electrodes). The amplitude (in microvolts) is represented on the or-dinate and negativity is up. The time from sound onset is on the abscissa (inmilliseconds).

order of the buttons. A row of 4 crosses, i.e., “XXXX,” was pre-sented on the screen 2500 ms after sound onset during 1500 msto give participants time to blink. The next sound was presentedafter a 500-ms silence. Participants’ responses were collectedfor all sounds except for the extreme sounds since they wereused in the training session. The electrical brain activity (Elec-troencephalogram, EEG) was recorded continuously for eachparticipant during the categorization task.

D. Results

1) Behavioral Data: Percentages of categorization as Wood,Metal, or Glass were obtained for each sound by averaging re-

sponses across participants. Fig. 2 allows visualization of thesedata as a function of sound position along the continua. Fromthese data, we determined a set of typical sounds for each ma-terial category: sounds were considered as typical if they werecategorized as Wood, Metal, or Glass by more than 70% of par-ticipants (we refer the reader to [39] for more details on the de-termination of this threshold of percentage value). In addition,sound positions delimiting categories along the continua can bedefined from the position ranges of typical sounds. Note that dueto the different acoustic characteristics of sounds, the categorylimits are not located at the same position for all continua (seeFig. 2).

2) Electrophysiological Data: We examined ERPstime-locked to sound onset to analyze the different stagesof information processing as they unfold in real time.2 ERPdata were averaged separately for typical sounds of each ma-terial category (Fig. 3). We summarize here the main findings(we refer the reader to a related article for more details [22]).

Typical sounds from each category elicited small P100 com-ponents, with maximum amplitude around 65-ms post-soundonset, large N100, and P200 components followed by N280components and Negative Slow Wave (NSW) at fronto-centralsites or large P550 components at parietal sites. Statisticalanalyses revealed no significant differences on the P100 andN100 components as a function of material categories. Bycontrast, they showed that typical Metal sounds elicited smallerP200 and P550 components, and larger N280 and NSW com-ponents than typical sounds from Wood and Glass. From anacoustic point of view, Metal sounds have richer spectra andlonger durations (i.e., lower damping) than Wood and Glasssounds. The early differences on the P200 components mostlikely reflect the processing of spectral complexity (see [42]and [43]) while the later differences on the N280 and on theNSW are likely to reflect differences in sound duration (i.e.,differences in damping; see [44] and [45]).

III. ACOUSTIC ANALYSIS

The typical sounds as defined based upon behavioral data(Section II-D1) form a set of sounds representative of each ma-terial category. To characterize these typical sounds from anacoustical point of view, we investigated the following descrip-tors: attack time (AT), spectral bandwidth (SB), roughness (R),and normalized sound decay that are defined below. Then,we examined the relationships between acoustic descriptors andtheir relevance to discriminate material categories.

A. Definition of Acoustic Descriptors

Attack time is a temporal timbre descriptor which character-izes signal onset. It is defined by the time (in second) necessary

2The ERPs elicited by a stimulus (a sound, a light, etc.) are characterized bya succession of positive (P) and negative (N) deflections relative to a baseline(usually measured within the 100 ms or 200 ms that precedes stimulus onset).These deflections (called components) are characterized by their polarity, theirlatency of maximum amplitude (relative to stimulus onset), their distributionacross different electrodes located at standard positions on the scalp and by theirfunctional significance. Typically, the P100, N100, and P200 components reflectthe sensory and perceptual stages of information processing, and are obligatoryresponses to the stimulation [40], [41]. Then, depending on the experimentaldesign and on the task at hand, different late ERP components are elicited (N200,P300, N400, etc.).

Page 6: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...€¦ · concept of consonance in a musical context [15], [16]. In the case of impact sounds, the perception of material seems

306 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 2, FEBRUARY 2011

for the signal energy to raise from a threshold level to the max-imum energy in the temporal envelope (for percussive sound) orto the sustained part (for a sustained sound with no decay part)[46], [47]. Different values have been proposed in the literaturefor both minimum and maximum thresholds. For our concern,we chose to compute the attack time from 10% to 90% of themaximum amplitude of the temporal envelope as in [48]. Thisdescriptor is known to be relevant to distinguish different classesof instrumental sounds. For instance, sounds from percussiveand woodwind instruments have respectively short and long AT.

Spectral bandwidth (in Hz), commonly associated with thespectrum spread, is defined by [49]

(10)

where SC is the spectral centroid (in Hz) defined by [50]

(11)

and where represents frequency, the Fourier transform ofthe signal estimated using the FFT algorithm and the FFT binindex. The FFT was calculated on samples.

Roughness (in asper) is commonly associated with the pres-ence of several frequency components within the limits of a crit-ical band. From a perceptual point of view, roughness is corre-lated with tonal consonance based on results from experimentson consonance judgments conducted by [51]. From a signalpoint of view, [52] have shown that roughness and fluctuationstrength are proportional to the square of the modulation factorof an amplitude modulated pure tone. We computed roughnessbased on Vassilakis’s model by summing up the partial rough-ness for all pairs of frequency components contained in thesound [53]

(12)

with

and where and are amplitudes and and are thefrequencies of components and , respectively.

Finally, the sound decay (in ) quantifies the amplitudedecrease of the whole temporal signal and globally character-izes the damping in the case of impact sounds. In particular,

approximately corresponds to the decay of the spectral com-ponent with the longest duration (i.e., generally the lowest fre-quency one). The sound decay is directly estimated by the slopeof the logarithm of the temporal signal envelope. This enve-lope is given by calculating the analytic signal using the Hilberttransform and by filtering the modulus of this analytic signalusing a second-order low-pass Butterworth filter with cutoff fre-quency of 50 Hz [36]. Since damping is frequency dependent

TABLE ICOEFFICIENTS OF DETERMINATION BETWEEN THE ATTACK TIME AT, THESPECTRAL BANDWIDTH SB, THE ROUGHNESS R, AND THE NORMALIZED

SOUND DECAY . SINCE THE MATRIX IS SYMMETRIC, ONLY THE UPPER PARTIS REPORTED. THE P-VALUES ARE ALSO REPORTED BY WHEN

COEFFICIENTS ARE SIGNIFICANT (WITH BONFERRONI ADJUSTMENT)

(Section II-B1), sound decay depends on the spectral contentof the sound. Consequently, we considered a normalized sounddecay denoted with respect to the spectral localization of theenergy and we defined the dimensional descriptor as the ratioof the sound decay to the SC value

(13)

B. Relationships Between Acoustic Descriptors

As a first step, we examined the relationships between theacoustic descriptors estimated on typical sounds. Table I showsthe coefficients of determination that are the square of theBravais–Pearson coefficients between pairs of descriptors. Wefound highest significant correlation (although not high interms of absolute value) between the two spectral descriptorsSB and R. Lowest correlations were found between AT andthe other ones, reflecting the fact that sound onset has littleinfluence on the spectral characteristics and does not depend onthe decaying part of the sound (described by ).

Second, a principal component analysis (PCA) was con-ducted on standardized values of acoustic descriptors (i.e.,values centered on the mean value and scaled by the standarddeviation value). Results showed that the first two principalcomponents explained about 72% of the total variance (the firstcomponent alone explained about 48%). As shown in Fig. 4,the first component was mainly correlated to the spectral de-scriptors (SB and R) and the second component to the temporaldescriptors (AT and ). Thus, PCA revealed that sounds couldreliably be represented in a reduced bi-dimensional space whichorthogonal axes are mainly correlated to spectral (ComponentI) and temporal descriptors (Component II), respectively. Thisresult confirmed that spectral and temporal descriptors bringcomplementary information on the sound characterization froman acoustic point of view.

C. Discrimination Between Material Categories

We examined the relevance of acoustic descriptors to dis-criminate material categories using a discriminant canonicalanalysis. This analysis was conducted using Materials (Wood,Metal, and Glass) as groups and standardized values of acousticdescriptors AT SB R as independent variables. Sincethree sound categories were considered, two discriminantfunctions that allow for the clearest separation between soundcategories were computed (the number of discriminant func-tions is equal to the number of groups minus one). These

Page 7: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...€¦ · concept of consonance in a musical context [15], [16]. In the case of impact sounds, the perception of material seems

ARAMAKI et al.: CONTROLLING THE PERCEIVED MATERIAL IN AN IMPACT SOUND SYNTHESIZER 307

Fig. 4. Biplot visualization: the observations corresponding to typical soundsare represented by dots and the acoustic descriptors by vectors. The contributionof descriptors to each Principal Component (PC) can be quantified by thestatistics given by a regression analysis: Attack time AT ( for PC Iand for PC II), Spectral bandwidth SB ( for PC I and

for PC II), Roughness R ( for PC I and forPC II) and Normalized sound decay ( for PC I and forPC II).

functions and were expressed as a combination of theindependent variables

(14)

The Wilks’s Lambda show that both functions (Wilks’s; ; ) and (Wilks’s ;

; ) are significant. The first functionexplains 96% of the variance coefficient of determination

while the second function explains the remaining vari-ance coefficient of determination . The coefficient as-sociated with each descriptor indicates its relative contributionto the discriminating function. In particular, the first functionis mainly related to and allows clear distinction particularlybetween typical Wood and Metal sounds as shown in Fig. 5. Thisresult is in line with previous studies showing that damping isa fundamental cue in the perception of sounds from impactedmaterials (see the Introduction). The second axis is mainlyrelated to the spectral descriptor R and allows for a distinctionof Glass sounds.

IV. CONTROL STRATEGY FOR THE SYNTHESIZER

Results from acoustic and electrophysiological data are nowdiscussed in the perspective of designing an intuitive control ofthe perceived material in an impact sound synthesizer. In partic-ular, we aim at determining relevant sound characteristics for anaccurate evocation of different materials and at proposing intu-itive control strategies associated with these characteristics. Inpractice, the synthesis engine and the control strategies were im-plemented using Max/MSP [54] thereby allowing for the manip-ulation of parameters in real-time and consequently, providingan easy way to evaluate the proposed controls. The observations

Fig. 5. Scatter plot of the canonical variables allowing the clearest separationbetween typical Wood , Metal , and Glass sound categories.

and conclusions from these synthesis applications are also re-ported.

A. Determination of Sound Characteristics

As a starting point, results from acoustic analysis revealedthat (characterizing the damping) was the most relevantdescriptor to discriminate material categories, therefore con-firming several findings in the literature on the relevance ofdamping for material identification (see the Introduction).Thus, damping was kept as a relevant sound characteristic tocontrol in the synthesizer.

Furthermore, acoustic analysis showed that in addition to thedamping, a second dimension related to spectral characteristicsof sounds was significant, in particular for the distinction be-tween Glass and Metal sounds. Interestingly, this result is sup-ported by electrophysiological data that revealed ERP differ-ences between Metal on one side and both Glass and Woodsounds on the other side. Since these differences were inter-preted as reflecting processing of sound duration (related todamping) and spectral content, ERP data showed the relevanceof both these aspects for material perception. Thus, from a gen-eral point of view, it is relevant to assume that material percep-tion seems to be guided by additional cues (other than damping)that are most likely linked to the spectral content of sounds. Thisassumption is in line with several studies showing the mate-rial categorization can be affected by spectral characteristics, inparticular, Glass being associated with higher frequencies thanMetal sounds [20], [55].

In line with this assumption, synthesis applications confirmedthat damping was relevant but was not in some cases sufficientto achieve material categories. For instance, it was not possibleto transform a given Wood or Glass sound into a Metal sound byonly applying a Metal damping on a Wood or Glass spectrum.The resulting sound did not sound metallic enough. It was there-fore necessary to also modify the spectral content, and in par-ticular the spectral distribution, to emphasize the metallic aspectof the sounds (examples in [38]). We found another limitationof damping in the case of Glass synthesis. Indeed, Glass soundsmatch a wide range of damping values (from highly damped

Page 8: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...€¦ · concept of consonance in a musical context [15], [16]. In the case of impact sounds, the perception of material seems

308 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 2, FEBRUARY 2011

jar sounds to highly resonant crystal glass sounds) and are mostoften characterized by a sparse distribution of spectral compo-nents (i.e., few distinct components). These observations indi-cated that the perceptual distinction between Glass and Metalmay be due to the typical dissonant aspect of Metal and canbe accurately reflected by the roughness that was highlightedas the most relevant descriptor in the acoustic analysis after thedamping. Thus, we concluded on the necessity to take into ac-count a control of the spectral shape, in addition to the controlof damping, for a more accurate evocation of the perceived ma-terial.

Besides, electrophysiological data provided complementaryinformation regarding the temporal dynamics of the brain pro-cesses associated with the perception and categorization of typ-ical sounds. First, it is known that P100 and N100 componentsare influenced by variations in sound onset parameters [56]. Thelack of differences on these ERP components is taken to indicatesimilar brain processes for all typical sounds, showing that theinformation of the perceived material does not lie in the soundonset. As a synthesis outcome, it means that a modification ofsound onset does not affect the nature of the perceived mate-rial and consequently, that AT is not a relevant parameter for thecontrol of the perceived material. Second, it is known that N100component is also influenced by pitch variations [40]. While oc-tave differences were largest between Glass and the other twocategories, the lack of differences on N100 component is takento indicate that pitch may not affect sound categorization. Thus,electrophysiological data support our previous assumption con-cerning the weak influence linked to the octave differences onsound categorization (Section II-B2).

Based on these considerations, damping and spectral shapingwere determined as relevant sound characteristics for materialperception. Control strategies associated with these characteris-tics are proposed and detailed in the following sections.

B. Control of damping

The control of damping was designed by acting on parame-ters and of the damping law (4). This control gave anaccurate manipulation of sound dynamics (time evolution) witha reduced number of control parameters. In particular, the pa-rameter governed the global sound decay (quantified by thedescriptor , Section III) and the parameter allowed con-trolling the damping differently for high- and low-frequencycomponents. This control made it possible to synthesize a widevariety of realistic sounds. Since from a physical point of view,high frequency components generally are more heavily dampedthan low frequency ones, we expected both parameters and

to be positive in the case of natural sounds.

C. Control of Spectral Shaping

We here propose two control strategies. The first one reliedon spectral dilation and was based on physical considerations,in particular on the dispersion phenomena. The second controlwas based on amplitude and frequency modulations and reliedon adding components to the original spectrum. This latter con-trol had a smaller influence on pitch compared with the firstcontrol since the original spectrum is not distorted. It also has

interesting perceptual consequences. For instance, by creatingcomponents within a specific frequency band (i.e., critical bandof hearing), this control specifically influenced the perception ofroughness that was highlighted as a relevant acoustic descriptorin the acoustic analysis (Section III). These two control strate-gies are detailed in the following sections.

1) Control by Spectral Dilation: From the analysis of nat-ural sounds, and from physical models describing wave prop-agation in various media (i.e., various physical materials), twoimportant phenomena can be observed: dispersion and dissipa-tion [32], [57]. Dissipation is due to various loss mechanismsand is directly linked to the damping parameters ( and )as described above. Dispersion is linked to the fact that the wavepropagation speed varies with respect to frequency. This phe-nomenon occurs when the phase velocity of a wave is not con-stant and introduces inharmonicity in the spectrum of the corre-sponding sound. An example of dispersive medium is the stiffstring for which the th partial is not located at but at

where is the fundamental frequency andthe coefficient of inharmonicity depending on the physical pa-rameters of the string [58]. We based our first spectral shapingstrategy on the spectral dilation defined by

(15)where is a window function (defined later in the text) and

(16)

with and that correspond to the frequency of the initialand shifted component of rank , respectively. Equation (16)is a generalization of the inharmonicity law previously definedfor stiff strings so that the expression is not limited to harmonicsounds but can be applied to any set of frequencies. and

are defined as the global and relative shaping parameters,respectively. Ranges of and are constrained so thatare real-valued and for all with thenumber of components. Thus, should be lower bounded

(17)

and should satisfy

for all (18)

A window function provided a local controlof spectral shaping within a given frequency range .In particular, it was of interest to keep the first components un-changed during spectral control to reduce pitch variations. Forinstance, a window function where is thesampling frequency can be applied to only act on frequencieshigher than . In practice, we chose a Tukey (tapered cosine)window defined between and . The window is pa-rameterized by a ratio (between 0 and 1) allowing the user tochoose intermediate profiles from rectangular to Hann

windows. Consequently, the user is able to act on theweight of the local control.

Page 9: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...€¦ · concept of consonance in a musical context [15], [16]. In the case of impact sounds, the perception of material seems

ARAMAKI et al.: CONTROLLING THE PERCEIVED MATERIAL IN AN IMPACT SOUND SYNTHESIZER 309

From an acoustic point of view, the control acts on the spectraldescriptors SB and R in a global way. For example, a decreaseof the value leads to a decrease of the SB value and at thesame time to an increase of the R value.

2) Control by Amplitude and Frequency Modulations: Am-plitude modulation creates two components on both sides of theoriginal one and the modulated output waveform is expressedby

where is the modulation index, the modulatingfrequency, the amplitude, and the frequency of the thcomponent.

Frequency modulation creates a set of components on bothsides of the original one and the modulated output waveform isexpressed by

where and is the Bessel function of order . Theamplitude of these additional components are given by the am-plitude of the original partial and the values of for agiven modulation index .

For both amplitude and frequency modulations, synthesis ap-plications showed that applying the same value of the modu-lating frequency to all components led to synthetic soundsperceived as too artificial. To avoid this effect, we proposed adefinition of the modulating frequency for each spectralcomponent based on perceptual considerations. Thus,was expressed as a percentage of the critical bandwidthassociated with each component [59]

(19)

where is expressed in kHz. Since increases with re-spect to frequency, components created at high frequencies aremore distant (in frequency) on both sides of the central compo-nent than components created at low frequencies. This providedan efficient way to control roughness since the addition of com-ponents within a critical bandwidth increases the perception ofroughness. In particular, it is known that the maximum sensorydissonance corresponds to an interval between spectral compo-nents of about 25% of the critical bandwidth [51], [60].

Synthesis applications showed that both spectral shapingcontrols allowed for morphing particularly between Glass andMetal sounds while keeping the damping unchanged. In thiscase, the damping coefficients of the modified frequencieswere recalculated according to the damping law. Both controlsprovided a local control since modifications can be applied oneach original component independently. The control based onamplitude and frequency modulations allowed subtle spectralmodifications compared with the control based on spectral

dilation and in particular, led to interesting fine timbre effectssuch as cracked glass sounds (sound examples can be found in[38]).

D. Control of the Perceived Material

A global control strategy of the perceived material that in-tegrates the previous damping and spectral shaping controls isproposed in this section. This strategy is hierarchically built onthree layers: the “Material space” (accessible to the user), the“Damping and Spectral shaping parameters” and the “Signal pa-rameters” (related to the signal model). Note that the mappingstrategy does not depend on the synthesis technique. As a conse-quence, the proposed control can be applied to any sound gener-ation process.3 Note also that the proposed strategy is not uniqueand represents one among several other possibilities [24], [25].

Fig. 6 illustrates the mapping between these three layersbased on the first spectral shaping control (using and ).The Material space is designed as a unit disk of center C withthree fixed points corresponding to the three reference sounds(Wood, Metal, and Glass) equally distributed along the externalcircle. The Glass sound position is arbitrarily considered asthe angle’s origin and consequently, the Metal soundis positioned at and the Wood sound at .The three reference sounds were synthesized from the sameinitial set of harmonic components (fundamental frequency of500 Hz and 40 components) so that Wood, Metal, and Glasssounds were obtained by only modifying the damping andspectral shaping parameters (values given in Table II and soundpositions shown in Fig. 6). These parameters were chosen onthe basis of the sound quality of the evoked material. Note thatthe reference sounds could be replaced by other sounds.

The user navigates in the Material space between Wood,Metal, and Glass sounds by moving a cursor and can synthe-size the sound corresponding to any position. When movingalong the circumference of the Material space circle, the corre-sponding sound characterized by its angle is generatedwith Damping and Spectral shaping parameters defined by

(20)where represent the parameter vector ofthe sound and of the reference sound of Glass , Metal

, and Wood . The function was defined so that theinterpolation process was exclusively made between two refer-ence sounds at a time

forforfor

(21)

Inside the circle, a sound characterized by its angleand its radius is generated with parameters defined by

(22)

3In practice, we implemented an additive synthesis technique (sinusoids plusnoise) in the synthesizer previously developed [23] since it was the most naturalone according to the signal model. Other techniques could have been consideredas well such as frequency modulation (FM) synthesis, subtractive synthesis, etc.

Page 10: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...€¦ · concept of consonance in a musical context [15], [16]. In the case of impact sounds, the perception of material seems

310 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 2, FEBRUARY 2011

Fig. 6. Control of the perceived material based on three-layer architecture: the “Material space” (user interface), the “Damping and spectral shaping parameters”and “Signal parameters.” The navigation in the Material space (e.g., from sound to sound ) involves modifications of both Damping and Spectral shapingparameters. The space was calibrated with specific domain for each material category (borders defined in Fig. 7). In this example, we represented theSpectral shaping parameters corresponding to the first spectral shaping control. Finally, at signal level, the damping coefficients are computed from(4) with values of and the frequencies were computed from (15) with values of . The Metal, Wood, and Glass reference sounds areconstructed from the same initial harmonic sound located at point (0,1) in the space. The role of spectral shaping with the window function isillustrated at the bottom. is represented in dotted and in bold. The amplitude of the spectrum was arbitrarily reduced for a sake of clarity.

TABLE IIVALUES OF DAMPING ( AND ) AND SPECTRAL SHAPING ( AND )

PARAMETERS CORRESPONDING TO THE REFERENCE SOUNDS OF METAL,WOOD, AND GLASS CATEGORY IN THE MATERIAL SPACE

where represents the parameter vector of the sound C de-fined by with bar symbol denoting the av-erage of the three values (corresponding to Wood, Metal, andGlass reference sounds) for each parameter and whereis defined in (20). A similar strategy was designed for the map-ping based on the second spectral shaping control (amplitudeand frequency modulations). In that case, the parameter vector

corresponded to .The second layer concerns the controls of Damping and

Spectral shaping parameters. For each control, the parametersare represented in two-dimensions to propose an intuitiveconfiguration, called damping and spectral shaping

spaces, respectively. The intuitive manipulation ofand was achieved by a calibration process that consisted

in determining a specific domain for each material category inthe space. Borders between material domains weredetermined based on results from predictive discriminationanalysis. In practice, we calibrated the space delim-ited by extreme values of and for typical sounds (rangeof and range of ; seeFig. 7). This space was sampled in 2500 evenly spaced points

Fig. 7. Calibration of the space from border sections betweenMetal and Glass (dashed line), between Metal and Wood (dotted line) and

between Glass and Wood (solid line). The positions of typical sounds forWood , Metal , and Glass used for the classification process are alsorepresented.

and each point was associated with a posterior probability of be-longing to a material category. This probability was computedfrom a Bayesian rule based on the knowledge of the positions oftypical sounds in the space. Classification functionswere determined between pairs of material categories and wereexpressed as quadratic combination of and .Boundary curves were materialized from the set of points thathave similar classification probabilities for both categories

(23)

Page 11: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...€¦ · concept of consonance in a musical context [15], [16]. In the case of impact sounds, the perception of material seems

ARAMAKI et al.: CONTROLLING THE PERCEIVED MATERIAL IN AN IMPACT SOUND SYNTHESIZER 311

Fig. 8. Perceptual evaluation test of the control strategy. Left: position of sounds (black markers) used for the test in the Material Space as a function of thetrajectory: along the external circle (first row), along the chords whose endpoints are the reference sounds (second row) and along the radii of the circle from thecenter C to the reference sounds (third row). Right: percentage of classification as Wood (gray curves), Metal (black curves), or Glass (dashed curves) correspondingto the three types of trajectory for each sound as a function of its position of the continuum.

or equivalently

(24)

where and and the categories.For our concern, the border noted between Metal and

Glass categories was defined by

(25)

and all the points for which this function is negative were clas-sified into the Metal category. The border between Woodand Metal was defined by

(26)

and all the points for which this function is negative were clas-sified into the Wood category. Finally, the border betweenWood and Glass regions was defined by

(27)

and all the points for which this function is negative were clas-sified into the Wood category. The calibration of thespace was completed by keeping the section of the borders thatdirectly separate the two sound categories: as shown in Fig. 7,the border section that was kept for is represented by adashed line, the section kept for is represented by a dottedline and the section kept for is represented by a solid line.These borders were reported in the Middle layer (Fig. 6) al-lowing an intuitive manipulation of damping parameters. Note

that these borders do not represent a strict delimitation betweensound categories and a narrow transition zone may be taken intoaccount on both sides of the borders. In particular, sounds be-longing to this transition zone may be perceived as ambiguoussounds such as sounds created at intermediate positions of thecontinua.

Finally, the bottom layer concerns the signal parameters de-termined as follows: the damping coefficients are computedfrom (4) with values and frequencies from (15)with values. The amplitudes are assumed to beequal to one.

V. PERCEPTUAL EVALUATION OF THE CONTROL STRATEGY

The proposed control strategy for the perceived material wasevaluated with a formal perceptual test. Twenty-three partici-pants (9 women, 14 men) participated in the experiment. Soundswere selected in the Material Space as shown in Fig. 8 (left).Three types of trajectory between two reference sounds wereinvestigated: along the external circle (by a 12-step continuum),along chords whose endpoints are the reference sounds (7-stepcontinuum) and along the radii of the circle from the center Cto the reference sounds (7-step continuum). Sounds were pre-sented once randomly through headphones. The whole set ofsounds are available at [38]. Participants were asked to catego-rize each sound as Wood, Metal, or Glass, as fast as possible, byselecting with a mouse on a computer screen the correspondinglabel. The order of labels displayed on the screen was balancedacross participants. The next sound was presented after a 2-sec-onds silence. Participants’ responses were collected and aver-aged for each category (Wood, Metal, and Glass) and for eachsound.

Fig. 8 (right) shows results as a function of sound positionalong the continua. Sounds at extreme positions were classifiedby more than 70% of participants in the correct category, leading

Page 12: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...€¦ · concept of consonance in a musical context [15], [16]. In the case of impact sounds, the perception of material seems

312 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 2, FEBRUARY 2011

to the validation of the reference sounds as typical exemplarsof their respective material category. In Wood–Metal transition,sounds at intermediate positions were classified as Glass withhighest percentages for those along the trajectory via the centerC. By contrast, in both Wood–Glass and Glass–Metal transi-tions, intermediate sounds were most often classified in one ofthe two categories corresponding to the extreme sounds. Froman acoustic point of view, this reflects the fact that, the interpo-lation of Damping parameters between Metal and Wood soundscrosses the Glass category while this is not the case for the othertwo transitions (see Fig. 6).

These results were in line with the ones obtained from thebehavioral data in the first categorization experiment (Fig. 2)and consequently, allowed us to validate the proposed controlstrategy as an efficient way to navigate in the Material space.Note that the interpolation process was computed on a linearscale between parameters of the reference sounds [cf. (21) and(22)]. The next step will consist in taking into account these re-sults and modify the interpolation rules so that the metric dis-tance between a given sound and a reference sound in the Ma-terial space closely reflects perceptual distance.

VI. CONCLUSION

In this paper, we proposed a control strategy for the per-ceived material in an impact sound synthesizer. To this end, weinvestigated the sound characteristics relevant for an accurateevocation of material by conducting a sound categorizationexperiment. To design the stimuli, sounds produced by im-pacting three different materials, i.e., wood, metal, and glass,were recorded and synthesized by using analysis–synthesistechniques. After tuning, sound continua simulating progres-sive transitions between material categories were built and usedin the categorization experiment. Both behavioral data andelectrical brain activity were collected. From behavioral data,a set of typical sounds for each material category was deter-mined and an acoustic analysis including descriptors knownto be relevant for timbre perception and material identifica-tion was conducted. The most relevant descriptors that allowdiscrimination between material categories were identified:the normalized sound decay (related to the damping) and theroughness. Electrophysiological data provided complementaryinformation regarding the perceptual/cognitive aspects relatedto the sound categorization and were discussed in the context ofsynthesis. Based on acoustic and ERP data, results confirmedthe importance of damping and highlighted the relevance ofspectral descriptors for material perception. Control strategiesfor damping and spectral shaping were proposed and tested insynthesis applications. These strategies were further integratedin a three-layer control architecture allowing the user to nav-igate in a “Material Space.” A formal perceptual evaluationconfirmed the validity of the proposed control strategy. Sucha control offered an intuitive manipulation of parameters andallowed defining realistic impact sounds directly from thematerial label (i.e., Wood, Metal, or Glass).

REFERENCES

[1] M. V. Mathews, “The digital computer as a musical instrument,” Sci-ence, vol. 142, no. 3592, pp. 553–557, 1963.

[2] R. Moog, “Position and force sensors and their application to keyboardsand related controllers,” in Proc. AES 5th Int. Conf.: Music DigitalTechnol., 1987, pp. 179–181, A. E. S. New York, Ed..

[3] M. Battier, “L’approche gestuelle dans l’histoire de la lutherie électron-ique. Etude d’un cas: Le theremin,” Proc. Colloque International , ser.Collection Eupalinos, Editions Parenthèses, 1999, Les nouveaux gestesde la musique.

[4] J. Tenney, “Sound-generation by means of a digital computer,” J. MusicTheory, vol. 7, no. 1, Spring, 1963.

[5] A. Camurri, M. Ricchetti, M. Di Stefano, and A. Stroscio, “Eye-sweb—Toward gesture and affect recognition in dance/music interac-tive systems,” in Proc. Colloquio di Informatica Musicale, 1998.

[6] P. Gobin, R. Kronland-Martinet, G. A. Lagesse, T. Voinier, andS. Ystad, From Sounds to Music: Different Approaches to EventPiloted Instruments, ser. Lecture Notes in Computer Science. :Springer-Verlag, 2003, vol. 2771, pp. 225–246.

[7] M. Wanderley and M. Battier, “Trends in gestural control of music,”IRCAM-Centre Pompidou, 2000.

[8] J.-C. Risset and D. L. Wessel, “Exploration of timbre by analysis andsynthesis,” in The Psychology of Music, ser. Cognition and Perception,2nd ed. New York: Academic, 1999, pp. 113–169.

[9] D. L. Wessel, “Timbre space as a musical control structure,” Comput.Music J., vol. 3, no. 2, pp. 45–52, 1979.

[10] S. Ystad and T. Voinier, “A virtually-real flute,” Comput. Music J., vol.25, no. 2, pp. 13–24, Summer, 2001.

[11] J. M. Grey, “Multidimensional perceptual scaling of musical timbres,”J. Acoust. Soc. Amer., vol. 61, no. 5, pp. 1270–1277, 1977.

[12] C. L. Krumhansl, “Why is musical timbre so hard to understand,” inStructure and Perception of Electroacoustic Sound and Music. Ams-terdam, The Netherlands: Elsevier, 1989.

[13] J. Krimphoff, S. McAdams, and S. Winsberg, “Caractérisation dutimbre des sons complexes. II: Analyses acoustiques et quantificationpsychophysique [characterization of timbre of complex sounds. II:Acoustical analyses and psychophysical quantification],” J. Phys., vol.4, no. C5, pp. 625–628, 1994.

[14] S. McAdams, S. Winsberg, S. Donnadieu, G. D. Soete, and J. Krim-phoff, “Perceptual scaling of synthesized musical timbres: Commondimensions, specificities, and latent subject classes,” Psychol. Res., vol.58, pp. 177–192, 1995.

[15] W. A. Sethares, “Local consonance and the relationship betweentimbre and scale,” J. Acoust. Soc. Amer. , vol. 94, no. 3, pp. 1218–1228,1993.

[16] P. N. Vassilakis, “Auditory roughness as a means of musical expres-sion,” Dept. of Ethnomusicology, Univ. of California, Selected reportsin ethnomusicology (perspectives in systematic musicology), 2005,vol. 12, pp. 119–144.

[17] R. P. Wildes and W. A. Richards, Recovering Material Properties FromSound, W. A. Richards, Ed. Cambridge, MA: MIT Press, 1988, ch.25, pp. 356–363.

[18] W. W. Gaver, “How do we hear in the world ? explorations of ecolog-ical acoustics,” Ecol. Psychol., vol. 5, no. 4, pp. 285–313, 1993.

[19] R. Lutfi and E. Oh, “Auditory discrimination of material changesin a struck-clamped bar,” J. Acoust. Soc. Amer., vol. 102, no. 6, pp.3647–3656, 1997.

[20] R. L. Klatzky, D. K. Pai, and E. P. Krotkov, “Perception of materialfrom contact sounds,” Presence: Teleoperators and Virtual Environ-ments, vol. 9, no. 4, pp. 399–410, 2000.

[21] B. L. Giordano and S. McAdams, “Material identification of real im-pact sounds: Effects of size variation in steel, wood, and Plexiglasplates,” J. Acoust. Soc. Amer., vol. 119, no. 2, pp. 1171–1181, 2006.

[22] M. Aramaki, M. Besson, R. Kronland-Martinet, and S. Ystad, “Timbreperception of sounds from impacted materials: Behavioral, electro-physiological and acoustic approaches,” in Computer Music Modelingand Retrieval—Genesis of Meaning of Sound and Music, ser. LNCS,S. Ystad, R. Kronland-Martinet, and K. Jensen, Eds. Berlin, Heidel-berg, Germany: Springer-Verlag, 2009, vol. 5493, pp. 1–17.

[23] M. Aramaki, R. Kronland-Martinet, T. Voinier, and S. Ystad, “A per-cussive sound synthesizer based on physical and perceptual attributes,”Comput. Music J., vol. 30, no. 2, pp. 32–41, 2006.

[24] M. Aramaki, R. Kronland-Martinet, T. Voinier, and S. Ystad, “Timbrecontrol of a real-time percussive synthesizer,” in Proc. 19th Int. Congr.Acoust. (CD-ROM), 2007, 84-87985-12-2.

[25] M. Aramaki, C. Gondre, R. Kronland-Martinet, T. Voinier, and S.Ystad, “Thinking the sounds: An intuitive control of an impact soundsynthesizer,” in Proc. 15th Int. Conf. Auditory Display (ICAD 2009),2009.

[26] K. Steiglitz and L. E. McBride, “A technique for the identification oflinear systems,” IEEE Trans. Autom. Control, vol. AC-10, no. 10, pp.461–464, Oct. 1965.

Page 13: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...€¦ · concept of consonance in a musical context [15], [16]. In the case of impact sounds, the perception of material seems

ARAMAKI et al.: CONTROLLING THE PERCEIVED MATERIAL IN AN IMPACT SOUND SYNTHESIZER 313

[27] R. Roy and T. Kailath, “Esprit-estimation of signal parameters via ro-tational invariance techniques,” IEEE Trans. Acoust., Speech, SignalProcess., vol. 37, no. 7, pp. 984–995, Jul. 1989.

[28] R. O. Schmidt, “Multiple emitter location and signal parameter estima-tion,” IEEE Trans. Antennas Propagat., vol. AP-34, no. 3, pp. 276–280,Mar. 1986.

[29] R. Badeau, B. David, and G. Richard, “High-resolution spectral anal-ysis of mixtures of complex exponentials modulated by polynomials,”IEEE Trans. Signal Process., vol. 54, no. 4, pp. 1341–1350, Apr. 2006.

[30] L.-M. Reissell and D. K. Pai, “High resolution analysis of impactsounds and forces,” in Proc. WHC ’07: 2nd Joint EuroHaptics Conf.Symp. Haptic Interfaces for Virtual Environment and TeleoperatorSyst., Washington, DC, 2007, pp. 255–260, IEEE Computer Society.

[31] M. Aramaki and R. Kronland-Martinet, “Analysis-synthesis of impactsounds by real-time dynamic filtering,” IEEE Trans. Audio, Speech,Lang. Process., vol. 14, no. 2, pp. 695–705, Mar. 2006.

[32] R. Kronland-Martinet, P. Guillemain, and S. Ystad, “Modelling of nat-ural sounds by time-frequency and wavelet representations,” OrganisedSound, vol. 2, no. 3, pp. 179–191, 1997.

[33] R. Parncutt, Harmony—A Psychoacoustical Approach. Berlin/Hei-delberg, Germany: Springer, 1989.

[34] A. Chaigne and C. Lambourg, “Time-domain simulation of dampedimpacted plates: I. Theory and experiments,” J. Acoust. Soc. Amer.,vol. 109, no. 4, pp. 1422–1432, 2001.

[35] T. Ono and M. Norimoto, “Anisotropy of dynamic young’s modulusand internal friction in wood,” Jpn. J. Appl. Phys., vol. 24, no. 8, pp.960–964, 1985.

[36] S. McAdams, A. Chaigne, and V. Roussarie, “The psychomechanicsof simulated sound sources: Material properties of impacted bars,” J.Acoust. Soc. Amer., vol. 115, no. 3, pp. 1306–1320, 2004.

[37] M. Aramaki, H. Baillères, L. Brancheriau, R. Kronland-Martinet, andS. Ystad, “Sound quality assessment of wood for xylophone bars,” J.Acoust. Soc. Amer., vol. 121, no. 4, pp. 2407–2420, 2007.

[38] 2010 [Online]. Available: http://www.lma.cnrs-mrs.fr/~kronland/Cat-egorization/sounds.html, last checked: Oct. 2009

[39] M. Aramaki, L. Brancheriau, R. Kronland-Martinet, and S. Ystad,“Perception of impacted materials: Sound retrieval and synthesis con-trol perspectives,” in Computer Music Modeling and Retrieval—Gen-esis of Meaning of Sound and Music, ser. LNCS, S. Ystad, R.Kronland-Martinet, and K. Jensen, Eds. Berlin, Heidelberg, Ger-many: Springer-Verlag, 2009, vol. 5493, pp. 134–146.

[40] M. D. Rugg and M. G. H. Coles, “The ERP and cognitive psychology:Conceptual Issues,” in Electrophysiology of Mind. Event-Related BrainPotentials and Cognition, ser. Oxford Psychology. New York: Ox-ford Univ. Press, 1995, pp. 27–39, no. 25.

[41] J. Eggermont and C. Ponton, “The neurophysiology of auditory per-ception: From single-units to evoked potentials,” Audiol. Neuro-Otol.,vol. 7, pp. 71–99, 2002.

[42] A. Shahin, L. E. Roberts, C. Pantev, L. J. Trainor, and B. Ross, “Mod-ulation of p2 auditory-evoked responses by the spectral complexity ofmusical sounds,” NeuroReport, vol. 16, no. 16, pp. 1781–1785, 2005.

[43] S. Kuriki, S. Kanda, and Y. Hirata, “Effects of musical experience ondifferent components of meg responses elicited by sequential piano-tones and chords,” J. Neurosci., vol. 26, no. 15, pp. 4046–4053, 2006.

[44] E. Kushnerenko, R. Ceponiene, V. Fellman, M. Huotilainen, and I.Winkler, “Event-related potential correlates of sound duration: Sim-ilar pattern from birth to adulthood,” NeuroReport, vol. 12, no. 17, pp.3777–2781, 2001.

[45] C. Alain, B. M. Schuler, and K. L. McDonald, “Neural activity asso-ciated with distinguishing concurrent auditory objects,” J. Acoust. Soc.Amer., vol. 111, no. 2, pp. 990–995, 2002.

[46] S. McAdams, “Perspectives on the contribution of timbre to musicalstructure,” Comput. Music J., vol. 23, no. 3, pp. 85–102, 1999.

[47] H.-G. Kim, N. Moreau, and T. Sikora, MPEG-7 Audio and Beyond:Audio Content Indexing and Retrieval. New York: Wiley, 2005.

[48] G. Peeters, “A Large set of audio features for sound description (simi-larity and description) in the Cuidado Project,” IRCAM, Paris, France,2004, Tech. Rep..

[49] J. Marozeau, A. de Cheveigné, S. McAdams, and S. Winsberg, “The de-pendency of timbre on fundamental frequency,” J. Acoust. Soc. Amer.,vol. 114, pp. 2946–2957, 2003.

[50] J. W. Beauchamp, “Synthesis by spectral amplitude and “brightness”matching of analyzed musical instrument tones,” J. Audio Eng. Soc.,vol. 30, no. 6, pp. 396–406, 1982.

[51] R. Plomp and W. J. M. Levelt, “Tonal consonance and critical band-width,” J. Acoust. Soc. Amer., vol. 38, pp. 548–560, 1965.

[52] E. Terhardt, “On the perception of periodic sound fluctuations (rough-ness),” Acustica, vol. 30, no. 4, pp. 201–213, 1974.

[53] P. N. Vassilakis, “SRA: A web-based research tool for spectral androughness analysis of sound signals,” in Proc. 4th Sound MusicComput. (SMC) Conf., 2007, pp. 319–325.

[54] 2009 [Online]. Available: http://www.cycling74.com/, last checked:Octob. 2009

[55] D. Rocchesso and F. Fontana, 2003, “The Sounding Object,”[Online]. Available: http://www.soundobject.org/SobBook/Sob-Book_JUL03.pdf last checked: Oct. 2009

[56] M. Hyde, “The N1 response and its applications,” Audiol. Neuro-Otol.,vol. 2, pp. 281–307, 1997.

[57] C. Valette and C. Cuesta, Mécanique de la corde vibrante (Mechanicsof vibrating string), ser. Traité des Nouvelles Technologies, série Mé-canique. London, U.K.: Hermès, 1993.

[58] N. H. Fletcher and T. D. Rossing, The Physics of Musical Instruments(Second Edition). Berlin, Germany: Springer-Verlag, 1998.

[59] E. Zwicker and H. Fastl, Psychoacoustics, Facts and Models. Berlin,Germany: Springer-Verlag, 1990.

[60] H. L. F. von Helmholtz, On the Sensations of Tone as the Physiolog-ical Basis for the Theory of Music, 2nd ed. New York: Dover, 1877,reprinted 1954.

Mitsuko Aramaki (M’09) received the M.S. degreein mechanics (speciality in acoustic and dynamics ofvibrations) from the University of Aix-Marseille II,Marseille, France, and the Ph.D. degree for her workat the Laboratoire de Mécanique et d’Acoustique,Marseille, France, in 2003, on analysis and synthesisof impact sounds using physical and perceptualapproaches.

She is currently a Researcher at the Institut de Neu-rosciences Cognitives de la Méditerranée, Marseille,where she works on a pluridisciplinary project com-

bining sound modeling, perceptual and cognitive aspects of timbre, and neuro-science methods in the context of virtual reality.

Mireille Besson received the Ph.D. degree in neu-rosciences from the University of Aix-Marseille II,Marseille, France, in 1984.

After four years of post-doctorate studies at theDepartment of Cognitive Science, University of Cal-ifornia at San Diego, La Jolla, working with Prof. M.Kutas and at the Department of Psychology, Univer-sity of Florida, Gainesville, working with Prof. I. Fis-chler, she obtained a permanent position at the Na-tional Center for Scientific Research (CNRS), Mar-seille, France. She is currently Director of Research

at the CNRS, Institut de Neurosciences Cognitives de la Méditerranée (INCM),where she is the head of the “Language, Music, and Motor” team. Her primaryresearch interests are centered on brain imaging of linguistic and non linguisticsound perception and on brain plasticity mainly using event-related brain po-tentials. She is currently conducting a large research project on the influence ofmusical training on linguistic sound perception in normal reading and dyslexicchildren.

Richard Kronland-Martinet (M’09–SM’10) re-ceived the M.S. degree in theoretical physics in 1980,the Ph.D. degree in acoustics from the Universityof Aix-Marseille II, Marseille, France, in 1983, andthe “Doctorat d’Etat es Sciences” degree from theUniversity of Aix-Marseille II in 1989 for his workon analysis and synthesis of sounds using time–fre-quency and time–scale (wavelets) representations.

He is currently Director of Research at the NationalCenter for Scientific Research (CNRS), Laboratoirede Mécanique et d’Acoustique, Marseille, where he

is the Head of the group “Modeling, Synthesis and Control of Sound and Mu-sical Signals.” His primary research interests are in analysis and synthesis ofsounds with a particular emphasis on high-level control of synthesis processes.He recently addressed applications linked to musical interpretation and semanticdescription of sounds using a pluridisciplinary approach associating signal pro-cessing, physics, perception, and cognition.

Page 14: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...€¦ · concept of consonance in a musical context [15], [16]. In the case of impact sounds, the perception of material seems

314 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 2, FEBRUARY 2011

Sølvi Ystad received the Ph.D. degree in acousticsfrom the University of Aix-Marseille II, Marseille,France, in 1998.

She is currently a Researcher at the NationalFrench Research Center (CNRS) in the researchteam S2M—Synthesis and Control of Sounds andMusical Signals—in Marseille, France. Her researchactivities are related to sound modeling with aspecial emphasis on the identification of perceptu-ally relevant sound structures to develop efficientsynthesis models. She was in charge of the research

project “Towards the sense of sounds,” financed by the French National Agency(ANR—http://www.sensons.cnrs-mrs.fr) from 2006–2009.