Time-Frequency Processing

464 J. Audio Eng. Soc., Vol. 60, No. 6, 2012 June

FEATURE ARTICLE

Francis RumseyStaff Technical Writer

Audio event identification and sound-morphing techniques weretwo key themes of the AES 45th International Conference,Applications of Time-Frequency Processing in Audio

Timefrequencyprocessing

As digital signal processing becomesmore powerful, there is increasingscope for handling audio signals in thefrequency domaindividing the signal intonarrow frequency bands and performingoperations on those bands independently.This is not a new concept, having been usedin applications such as low bit-rate coding formany years, But the amount of work in thefield has become so great that the AES organ-ized a conference recently in Finland, chairedby Ville Pulkki, devoted entirely to the topic.This article attempts to condense some of thecontributions from that conference, takingwhat is essentially a heavily mathematicaltopic and presenting it in a more digestibleform.

Because the human hearing mechanism isfrequency-selective, and models of it tend toassume a number of so-called critical bandswithin which energy is integrated and mask-ing takes place, a frequency-selectiveapproach to audio signal processing lendsitself well to applications that need to takeinto account features of human hearing.These include psychoacoustic models, lowbit-rate coding, audio effects, qualityenhancement, sound synthesis, semanticanalysis, and speech processing. The conceptof timefrequency (TF) processing comesinto play because frequency domain process-ing needs to take into account the temporalaspects of the signal. Transforms from thetime to frequency domains typically work onthe basis of chopping up the audio signal intoa series of time windows. It is a basic princi-ple of the theory underlying this field thatthere is a direct relationship between thetime and the frequency domainsthe reso-lution available in one domain being directly

connected to the resolution selected in theother. One of the keys to success in algo-rithms that process audio in this way is todetermine the optimum trade-offs betweentime and frequency resolution to achieve thebest quality, and this may be made variable oradaptive based on the signal content andother contextual information. For example,when a signal has rapid transients separatedby relatively low-level continuous segments,there can be advantages to choosing rela-tively short time windows. Whereas higherlevel continuous signals can warrant the useof longer time windows.We concentrate here on two specific appli-

cations of TF processing in audio engineer-ing, represented prominently at theconference, namely the identification or clas-sification of sounds and audio morphing.

IDENTIFYING AUDIO EVENTSOne of the primary themes that emerged fromthe papers presented at the conference was theuse of timefrequency processing in the iden-tification of audio events. This can include theautomatic isolation of specific features of anaudio signal such as percussive sounds, orhigher-level content recognition and classifi-cation such as the identification of soundsmade by particular individuals or things.

THE BENEFITS OF BLTIn his paper Detection of Audio Events byBoosted Learning of Local TimeFrequencyPatterns, Aki Hrm talks about a classifierthat he calls BLT, which in this case is notbacon, lettuce, and tomato but boosted learn-ing of local time-frequency patterns, based ona previous algorithm known as AdaBoost. It isapplied particularly to the detection of short

audio events such as footsteps in environmen-tal recordings and percussion sounds inmusic. Such sounds have a distinctive tem-plate that makes them stand out from otherssuch as speech or musical tones. Hrmlikens the problem of detecting them to simi-lar problems of object recognition in machinevision. However the prototype templates thatare needed as training data for pattern match-ing can be confused by local variants in thesignal such as reverberation and backgroundnoise, so there are advantages to deriving thetraining data from the local signal. In the BLTprocess the training data are derived from asmall number of locally collected TF patternsof interesting events.

BLT uses a nonuniform frequency scale,partly because footsteps and other naturalsounds have more energy at low frequenciesthan high and partly because the human hear-ing system employs an approximately loga-rithmic frequency scale. These things informan assumption that nonuniform frequencyscaling may be beneficial in the recognition ofnatural sounds. BLT also takes account of acombination of spectrum and temporal enve-lope features, splitting the signal into anumber of short-time power spectrum esti-mates, grouped into auditory bands. Templatematching is undertaken using a series of weakclassifiers, which are trained on successivesimilar events in the signal (e.g., a series offootsteps). Features of the signal are selectedthat most successfully divide the training datainto events and nonevents, which requiresin Hrms case that training recordings often individuals walking in a hallway aremanually annotated to separate real footstepsounds from other similar noises in therecording.


FEATURE ARTICLE

In testing the BLT algorithm on other realrecordings, footstep sounds were combinedwith other sounds such as door slams,speech, aircon, and printer noise. Anotherround of testing used a song in which snaredrum hits were to be identified from amongguitars and a vocalist. BLT was comparedwith another typical sound classificationmethod and found to perform considerablybetter in the case of the footstep sounds. Inthe case of the drum sounds the perform-ances of the two methods were similar.Hrm concludes that BLT has the potentialto perform well on noisy and difficult data.

SONIC HANDPRINTSAntti Jylh and his colleagues explain thatnonspeech sounds such as hand claps maybe able to convey enough information toidentify a person. This is particularly inter-esting in security applications where multi-ple sources of information can be employedto identify individuals. In Sonic Handprints:Person Identification with Hand ClappingSounds by a Model-Based Method they notethat people can often distinguish their ownhand claps from those of others, and thatautomatic recognition algorithms trained onone persons hand claps do not necessarilywork well on anothers.Similarly to the previously mentioned

work by Hrm, the challenge in this case isto distinguish subtle differences betweenpercussive sounds, again using a form oftemplate matching. The authors adapted aprobabilistic model involving a hiddenMarkov model, developed by others for pitchtracking in streamed audio, turning it intoone capable of dealing with predefined, shortpercussive events. Because reverberation inrecordings can degrade the recognitionaccuracy of such systems, some post-processing was employed to skip over tenframes after the start of each detected eventin order to reduce this problem.

When testing the algorithm, a sound bankof hand claps was captured using a laptopcomputer microphone from 16 differentpeople, in a room with a reverberation time ofabout 0.7 seconds. No particular effort wasmade to control the style of clapping in orderto capture the most natural sounds possible.A series of spectral templates was derivedfrom a subset of the hand-clap recordingsused as training data, shown in Fig. 1. Here itcan be seen how different the templates ofeach subject are. In the case of some peoplethe way they used their hands changed overthe clapping sequence, so the spectral shapeof the templates is based on an overall patternof what was essentially a changing phenome-non. The classification accuracy was 64%,which is well above the chance level of 6.25%that applies to this number of individuals.Inevitably the performance was particularlygood if a persons template contained uniqueregions of high energy that differentiatedthem clearly from others, and also if theywere particularly consistent in their clapping.

EVALUATING BUTTON SOUNDSAnother possible application of TF processingin the identification of audio events is

described by Kensuke Fujinokiand his colleagues in Auto-mated Evaluation for ButtonSounds from Wavelet-BasedFeatures. Here the aim is tofind features of the audio sig-nals arising from button-pushes that distinguish thequality of one from that ofanother. The authors describetheir use in a previous study ofa particular type of TF trans-form whose resolution in bothdomains corresponds closelyto the characteristics ofhuman hearingthe continu-ous wavelet transform (CWT).

The waveform of a typical push-button soundis shown in Fig. 2, along with its TF repre-sentation using wavelets. In this study aslightly different method using triangularbiorthogonal wavelets is employed, whichenables a complicated multiscale pyramidrepresentation of the signal in three dimen-sions. From this the cumulative sound pres-sure and reverberation elements of the signalcan be identified, with a better rejection ofbackground noise than when simply measur-ing the sound pressure of the original signal.

In a psychoacoustic experiment theauthors employed a semantic differentialmethod (one involving distinctions betweentwo opposing verbal terms) to determinedescriptive features of the button-pushsounds that could be mapped to the featuresextracted from the audio signals. Somesuccess was had in using the wavelet-basedfeatures to enable the automatic recognitionof listeners preferenceannoyance scores forthe different button-push sounds.

SOUND MORPHING USINGWAVELETSAside from its use in the identification andclassification of sounds, TF processing canalso be used in sound-morphing applica-tions. Morphing involves either the gradualtransformation of one sound into another or

Fig. 1. Spectral templates of 16 subjects hand claps used intemplate matching (courtesy Jylh et al.)

WHATS A HIDDENMARKOV MODEL?Hidden Markov models are used quitewidely in audio recognition systems. TheMarkov assumption being that a systemscurrent state is based on all of its previousstates. (Most practical models in fact onlydepend on the most recent states ratherthan all previous states.) A hidden Markovmodel attempts to infer information about ahidden process that is creating observableinformation, enabling it to predict with acertain degree of probability what are themost likely features of the hidden process.

Fig. 2. (a) Typical waveform of a push-buttonsound. (b) Timefrequency representation ofthe same sound using Morlet wavelets.Different colors represent the sound amplitudeat a point on the plot, according to the right-hand scale (courtesy Fujinoki et al.).

the hybridization of two sounds, so that thecharacteristics of one are superimposed onthose of another. It cannot usually be doneby just a simple mixture of two sounds in thetime domain, as the ear is very good at iden-tifying the original components of linearmixtures and decomposing them perceptu-ally. The new sound has to be a true prod-uct of the two original sounds, which usu-ally means some sort of convolution orfeature extraction and interpolation. Gabrielli and Squartini describe a new

process for achieving this in Ibrida: A NewDWT-Domain Sound Hybridization Tool.Like the sound-analysis method mentionedin the previous section, Ibrida employs awavelet transform to decompose the origi-nal sound into its spectral elements becauseof its nonuniform timefrequency resolu-tion (preserving signal transients) and lowcomputational requirements. This is

followed by an inverse wavelet transform toput the signal back in the time domain.Advantages of the nonuniform characteris-tics of the wavelet transform over alterna-tives such as the short-time Fouriertransform (STFT) include the likelihoodthat a greater number of the transformedfrequency bands will cover the main contentof music and speech signals. Uniform trans-forms tend to result in a lot of the bandsbeing in regions of the spectrum wherethere isnt much action in the audio signal,so they are largely irrelevant.The idea of hybridizing more than one

sound source is taken back to its earliestinstances, considering instruments such asthe jaw harp (in which the vocal tract inter-acts with the instrument) and the didgeridoo,in which the resonances of the vocal tractinteract with a resonating wood tube. Inmodern electronic music terms we tend totalk in terms of cross synthesis, vocoding,and morphing. Fig. 3 shows the basic processinvolved, in which a number of originalsignals are subjected to a discrete wavelettransform (DWT), resulting in a set of sub-band signals (in separated frequency bands).The equivalent subband signals from eachsource are mixed in appropriate proportionsto create the new sound, then the resultingsignal is subjected to an inverse discretewavelet transform (IDWT) to return it to thetime domain. By adjusting the proportions ofthe different signals mixed together in eachfrequency band, the characteristics of theoutput sound can be made more or less likeone of the input signals in different regions ofthe spectrum. Another advantage of the DWTin this context is said to be that it has a rela-tively small number of control parameters,which makes for easy user interaction.The authors implemented a version of

Ibrida in the Pure Data graphical program-ming environment, which enables thehybridization of two mono inputs withcontrol of the wavelet coefficients and mixingparameters. It can be downloaded in proto-type form from http://a3lab.dibet.univpm.it/downloads/Ibrida-tool.

MORPHING USING A GABORMASKOlivero et al., in Sound Morphing StrategiesBased on Alterations of TimeFrequencyRepresentations by Gabor Multipliers,define a sound morph as a hybrid soundwhose timbre is intermediate between asource and a target sound, having the samefundamental frequency, duration, and loud-ness. Many traditional approaches to morph-ing are said to work according to a sinusoid-plus-noise model of the signal, in which the

ALTERNATIVETIMEFREQUENCYTRANSFORMSThe original TF transform devised by Fourierassumed that the sound being transformedcontinued over an infinite period withoutchanging, resulting in a representation inthe frequency domain by a set of sinewave components and their phases. Short-time granular transforms such as Gaborand wavelet, divide up the signal into aseries of small time units, allowing thespectral components and their phases to betracked as the signal changes over time. Inthe Gabor case the units are known asatoms and time windowing involvesmultiplying the time-domain signal by aGaussian function before using the short-time Fourier transform to move it into thefrequency domain. Wavelet transformsinvolve the decomposition of complexsignals into a set of wavelets, which arebrief wave-like oscillations that start andend with zero amplitude, building to apeak in the center of the time period, suchas shown here.

Typical pattern of a wavelet, in thiscase showing the real (blue) andimaginary (red dotted) parts of acomplex Morlet wavelet. The Morlet (orGabor) wavelet has characteristicsclosely related to human perception(courtesy Fujinoki et al.).

J. Audio Eng. Soc., Vol. 60, No. 6, 2012 June 467

FEATURE ARTICLE

FEATURE ARTICLE

original signals are represented as a summa-tion of harmonic partials. (Morphing can beachieved in this approach by interpolatingbetween the partials in the frequencydomain to obtain a timbre that lies betweenthe two original signals.) These authors,however, prefer not to assume a formal sig-nal model and deal directly with the TF rep-resentation of sounds. One of the centralplanks of their approach is the Gabor TF rep-resentation of sounds. The Gabor transformis essentially a way of converting a signalfrom the time to the frequency domain thatallows for the fact that real sounds changeover time. Gabor multipliers are essentially functions

that alter the level and phaseof groups of frequency bandsin the TF representation ofthe signal so as to perform amodification to the spectrumduring each short time frame.They can be used to createtime-varying filters as long asthe signal does not havemassive TF shifts. The processis a form of convolution,whereby the filter functionis convolved with the signal by

point-wise multiplication in the frequencydomain. The Gabor mask is the spectralshape or transfer function of the multiplierneeded to achieve a convincing morphbetween the two original sounds. Various approaches to estimating suitable

masks for morphing are discussed by theauthors of this paper, including values thatgive rise to conventional cross synthesismorphing effects such as addition (simplemixing) and multiplication (essentially filter-ing of one signal by another) of the sourceand target spectrograms. As this representedPh.D. work in progress, there was consider-ably more work to be done before a clearconclusion can be reached about the best

choice of parameters and control methods forhigh-quality morphing using this approach.

MORPHING OF PERCUSSIVESOUNDSThe sinusoidal model of sounds mentioned atthe beginning of the previous section is notparticularly suitable for use when morphingnoisy sounds such as percussion instruments.Andrea Primavera and his colleagues there-fore discuss an alternative that can be used forautomatic morphing of such sounds in theirpaper Audio Morphing for Percussive HybridSound Generation. It involves two main fea-tures, namely preprocessing of the originalsignals in the frequency domain, followed bylinear interpolation in the time domain.A basic block diagram of the approach is

shown in Fig. 4. The main elements of pre-processing involves time alignment of theoriginal signals and scaling of the releaseportion of the sound envelopes so that thetwo original sounds have the same length.This is not applied to the attack portion of theenvelopes because those are crucial to thesounds unique perceptual identity. In orderto determine the attack and release portionsof the percussive sounds, the authors makeuse of the Amplitude Centroid Trajectory(ACT) and the reverberation time of thesound (the point at which it has decayed by60 dB). The ACT model is based on an evalua-tion of the spectral centroid, which essentiallydetermines the primary energy peak in thefrequency spectrum. As shown in Fig. 5, theend of the attack and the start of the decay isdetermined to be when the slope of the spec-tral centroid changes direction.It is explained that linear interpolation in

the time domain between two sets of originalaudio samples tends not to be very effectiveas a high-quality morphing techniquebecause one can hear the original sounds inthe morphed result. This happens mostwhen the two sounds are noticeably differentfrom each other. However, in this case it isclaimed to be appropriate because the drumsounds concerned often have a similar pitchto each other and a very noisy spectrum,which makes the approach more suitablethan alternatives based on additive synthesis.In order to confirm this, the authors attemptsome subjective and objective comparisonsof the various alternatives, namely linearinterpolation on its own, the new approachincluding a preprocessing stage, and sinu-soid-plus-noise modeling.In the subjective tests listeners were asked

to estimate the interpolation factor betweenthe two original sounds (in other words, howmuch of each was contained in the morphedresult), along with the quality (defined in


Fig. 3. Basic operation of wavelet transform-based soundhybridization approach (courtesy Gabrielli and Squartini)

terms of naturalness). The estimated interpo-lation factor was found to be very close to thephysical interpolation factor when using thenew approach involving preprocessing,whereas with conventional interpolation therelationship was nothing like a linear (aperceived factor of 0.8 related to a physicalfactor of around 0.5, for example). Themorphed sounds created using sinusoidalmodeling were perceived as highly unnatural,which confirmed the opinion that thismethod was unsuitable for percussive sounds.A multidimensional scaling experiment inwhich linearly interpolated sounds werecompared in pairs for their similarity demon-

strated that releasetime seemed to be themost important feature in the perception ofthese hybrid percussive sounds.

CONCLUSIONAudio identification and classification sys-tems, as well as sound morphing algorithms,depend heavily on timefrequency process-ing to achieve high-quality results. Theemphasis of current research seems to be onthe choice of transform used when movingbetween domains and the precise ways inwhich the transfer functions and frequencyscalings are determined in order to achieve

the most successful results. Formal psychoa-coustic experiments are increasingly used toevaluate the outcomes of the TF processingalgorithms used in these fields, which lendsanother dimension to what has hithertobeen a field primarily concerned with thesignal processing challenges.

Fig. 4. Block diagram of a morphing process for percussive sounds,showing frequency domain processes in yellow and time domainprocesses in blue (Figs. 4 and 5 courtesy Primavera et al.).

Fig. 5. Time evolution of the spectral centroid (shown red) for apercussive signal waveform (shown blue). The estimated boundarybetween attack and release portions is shown in green.

J. Audio Eng. Soc., Vol. 60, No. 6, 2012 June 469

FEATURE ARTICLE

www.nti-audio.comNTi Audio AG Liechtenstein+423 239 6060

NTI Americas Inc. Portland, Oregon, USA+1 503 684 7050

NTI ChinaSuzhou, Beijing, Shenzhen+86 512 6802 0075

NTI JapanTokyo, Japan +81 3 3634 6110

FLEXUS Audio AnalyzerSuperior specifications cover numerous research and design applications over wide level and frequency ranges.

System is optimized for measurement speed. Fast glide sweeps typically provide all relevant measurements on all channels in less than one second.

Configure to exactly meet your audio test needs. Add measurement channels, switchers, impedance measurement modules and interfaces.

.NET-based FX-Control suite provides comprehensive, intuitive access to all controls and measurement func-tions. Control several instruments through one suite.

Leading-edge measurement technology reveals the speaker parameters, including Rub&Buzz, with a single stimulus. PureSound provides complete speaker characterization with unparalleled correlation to human hearing.

FLEXUS

AnalyzAudio FLEXUS

erAnalyz

FLEXUSPe r f o r m a n c e S u p e r i o r s p e c i fi c a t i o n s co c ow i d e l e v e l a n d frequency ranges. r e q u e n cy y

S p e e d i n y o u Sy s t e m i s o p t i m i ze d f o r m a l l r e l e v a n t m e a s u r e m e n t

AnalyzAudio FLEXUS i n y o u r l a b

ov e r n u m e r o u s researc e s e a r ch a n d d e s i g n a p r a n g e s .

u r l i n e m e a s u r e m e n t s p e e d . Fa s t g l i d e s w e e p s

s o n a l l ch a n n e l s i n l e s s t h a n o n e s e c o

erAnalyz

p p l i c a t i o n s ov e r

s t y p i c a l l y p r ov i d e o n d .

M o d u l a r s y s t e m C o n fi g u r e to exactly meet your audio test needs. to exactly meet your audio test needs. o e x a c ct l y m e e t y o u A d d m e a s u r e m e n t ch a n n e l s , switcm e a s u r e m e n t modules and interfaces. modules and interfaces. o d u l e s a n d i n t

F X - C o n t r o l s o ft f ftw ftw.NE - b a s e d FX-Control suite pro X - C o n t r o l s u i t e pro p rTTT-based FX-Control suite proT intuitiv e access to all controls and measurement func a c c e s s t o a l l c o n t r o l s t i o n s . C o n t r o l several instruments through one suite. e v e r a l i n s t r u m e

m r audio test needs. a u d i o t e s t needs. n e e d s .

w i t chers , i m p e d a n c e t e r f a c e s .

ware r ov i d e s comprehensi comprehensi o m p r e h e n s iv e ,

a n d measurement func measurement func e a s u r e m e n t func func u n c- e n t s t h r o u g h o n e suite. suite. u i t e .

PureSoundLeading-edge measurement tecspeaker parameters, including Rstimulus. PureSound providescharacterization with unparallelehearing.

chnology reveals the Rub&Buzz, with a single s complete speaker ed correlation to human

icas@nti-audioamer+1 503 684 7050

tland,orPAmerNTI

.como@nti-audioinf+423 239 6060

einensthtLiecG Audio ANTi

.comhina@nti-audioc075 +86 512 6802 0

Shenzhen Beijing,Suzhou,NTI China

.comicas@nti-audio+1 503 684 7050

USAon,eg Ortland,icas Inc.Amer

www.comjapan@nti-audio

01+81 3 3634 61apan J,ookyTToky

apanNTI J

.com

Shenzhen

.com.nti-audiowww

Editors note: Purchase a CD-ROM with allAES 45th Conference papers atwww.aes.org/publications/conferences/.To purchase individual papers go towww.aes.org/e-lib/.

Documents

Time-Frequency Processing