Adjusting the spectral envelope evolution of transposed ...kronland/Dafx10/Article.pdf · Adjusting the spectral envelope evolution of transposed sounds with Gabor mask prototypes

Adjusting the spectral envelope evolutionof transposed sounds with Gabor mask

prototypes

Adrien Sirdey Olivier Derrien Richard Kronland-Martinet

Laboratoire de Mecanique et d’AcoustiqueCentre National de la Recherche Scientifique (CNRS)

31, Chemin Joseph Aiguier13402 Marseille Cedex, France

Corresponding author: [email protected]

Abstract

Audio-samplers often require to modify thepitch of recorded sounds in order to generatescales or chords. This article tackles the useof Gabor masks and their capacity to improvethe perceptual realism of transposed notes ob-tained through the classical phase-vocoder al-gorithm. Gabor masks can be seen as op-erators that allows the modification of time-dependent spectral content of sounds by modi-fying their time-frequency representation. Thegoal here is to restore a distribution of energythat is more in line with the physics of thestructure that generated the original sound.The Gabor mask is elaborated using an esti-mation of the spectral enveloppe evolution inthe time-frequency plane, and then applied tothe modified Gabor transform. This operationturns the modified Gabor transform into an-other one which respects the estimated spec-tral enveloppe evolution, and therefore leadsto a note that is more perceptually convincing.

I Introduction

The aim of this study is to improve the making of dig-ital samplers, by extending the scale range whereinnotes can be extrapolated from a single recording.With a classical phase-vocoder transposition, the colorof the sound gets perceptually deteriorated as soon asthe transposition range exceeds between 1.5 to 2 tones.Among the different causes of this phenomena lies theunmodified relationship between the amplitude anddamping of the partials implied by the phase-vocoderalgorithm, which is contradictory with the physics ofvibrating structures. Basically, the relative amplitudesof partials is a characteristic of the structure that gen-erated the sound and the damping coefficient usually

increases with frequency, whereas applying the phase-vocoder transposition algorithm results in a simple dis-placement of the partials along the frequency axis,without any modification of amplitude or damping. Itis shown here that suitable Gabor masks prototypescan be used in order to restore a distribution of energythat is more in line with the physics of the sound.

After a review of the theoretical aspects regardingthe phase-vocoder algorithm and the Gabor transform,the result of a traditional phase-vocoder is comparedto a real harmonic sound at the same pitch. This com-parison is made by use of Gabor transforms and Gabormasks, through which some of the phase-vocoder lim-itations are highlighted in a next section. These ob-servations justify the use of ‘mask prototypes’ whichapplications and results are presented in the last sec-tion.

II Theoretical background

II.1 Gabor transforms and Gabor masks

Most of the mathematical results concerning the Ga-bor transform presented here are deeply reviewed in[1]. The theoretical fundaments of Gabor masks havebeen described in [2], and their application to audiosounds has already been investigated in [3].

The Gabor transform is a discrete version of theshort time Fourier transform. Considering a signalx(t) and a Gabor frame f = {g, τ, ν} where g(t) is thetime-window and (τ, ν) are the time-frequency latticeparameters, the analysis operator is defined as:

Cfx(m,n) = c(m,n) (1)

=∫ +∞

−∞x(t)e−2iπmνtg(t− nτ)dt

where g is the complexe conjugate of g, m is the thediscrete frequency index and n the discrete time index.(1) can also be written in terms of a scalar productof x(t) with the so-called Gabor atoms MmνTnτg =e2iπmνtg(t− nτ):

Cfx(m,n) = 〈x,MmνTnτg〉L2

where M is the frequency-modulation operator and Tis the time-translation operator.

The synthesis operator is given by:

Dfc(t) =∑m∈Z

∑n∈Z

c(m,n)MmνTnτg (2)

It is shown in [1] that suitable choices on g, τ and ν im-plies the relation x(t) = DfCfx(t), which means thatthe Gabor transform can be inverted. The modulus oftwo Gabor transforms are presented on figure (2a) and(2d).

Gabor multipliers are operators which modify sig-nals through a point-wise multiplication on their Gabortransform. Given any m ∈ `2(Z) named Gabor mask,the associated Gabor multiplier Mm is defined as :

Mmx(t) = DfmCfx(t)

Given two signals x1 and x2 called respectively thesource signal and the target signal, if those two sig-nals share close supports in the time-frequency plane(see [3] or [4] for more details), one can search for aGabor multiplier which allows to pass from x1 to x2

by minimizing the cost function :

Φ(m) = ‖x2(t)−DfCfx1(t)‖2 + λ‖m− 1‖2

where the Lagrange parameter λ ∈ R+ is introducedin order to control the norm of m. It is shown in [3]that under the approximation that Df is an isometry,such a Gabor mask is given by :

m =Cfx1Cfx2 + λ

|Cfx1|2 + λ(3)

As λ increases, the mask values get closer to one ; as λdecreases, (3) becomes more and more equivalent to apoint-wise division in the time-frequency plane. Figure(2c) shows the modulus of the mask calculated betweenthe two Gabor transforms displayed on figure (2d) and(2b).

Thus, knowing two signals, Gabor masks providean analysis tool that highlights the differences betweentheir respective Gabor transforms. But they can alsobe seen as operators that allow the transformation ofa Gabor transform into another one. However, theirestimation requires the knowledge of both the sourceand target signals. In section IV an a prori Gabor maskamplitude profile is elaborated (the so-called mask pro-totype) by only considering the expected behavior ofthe target sound.

II.2 The Phase-Vocoder Transposition Al-gorithm

The phase-vocoder theory is usually presented usingthe short-time discrete Fourier transform, but can aswell be described using the Gabor transform formalismsuch as follows.

The classical phase-vocoder transposition algo-rithm combines a re-sampling that brings the origi-nal sound to the desired frequency, with the so-called‘time-scaling’ procedure that brings the sound backto its original duration. Considering a harmonic sig-nal x1, ω1 its fundamental frequency, and ω2 the fre-quency at which x1 is to be transposed, defining theratio r = ω2/ω1, the phase-vocoder transposition algo-rithm runs as follows :

1. Re-sampling x2(t) = x1(rt)

2. Analysis with the Gabor frame fa = {ga, τa, ν}

c2(m,n) = Cfa x2(t)

3. Time-scaling by a factor r

This procedure corresponds to a change of Ga-bor frame. In the synthesis Gabor frame fs ={gs, τs, ν}, the time-scale is dilated/compressed,and the frequency scale is unchanged. So, thenew window and time-step are respectively gs =ga(t/r) and τs = rτa. The aim of the time-scaling procedure is to compute an estimation ofthe Gabor coefficients c2(m,n) corresponding tothe transposed sound x2(t) at the original tempo.One assume that the amplitudes are unchanged,while the phases are modified according to a so-called horizontal phase-coherence relationship.

c2(m,n) = |c2(m,n)|eiϕc2 (m,n)

For each frequency bin m, un-wrapping the phaseof c2 allows to estimate an instantaneous fre-quency ωi which is used to compute the cor-rected phase ϕc2(m,n) at time n as a functionof ϕc2(m,n− 1) :

ϕc2(m,n) = ϕc2(m,n− 1) + ωi(m,n)rτa

Since the determination of the phase ϕc2(m,n) isa recursive process, it is necessary to set its valueat a certain instant. Here, the original phase hasbeen kept unchanged at the instant of attack, ac-cording to a ‘phase-locking’ procedure (see [6]).Since the considered sounds are recordings of sin-gle notes, they contain only one intensity peak,and thus no complex peak detection algorithmneeds to be used.

4. Inverse Gabor transform with the frame fs :

x2(t) = Dfsc2(m,n)

(a) Gabor transform modulus (b) Fitted spectral enveloppe evolution

Figure 1: Gabor transforms modulus of a real C (65 Hz) ([5]) played at the piano and the spectral enveloppeevolution fitted with its amplitude peaks.

III Phase-Vocoder Drawbacks andlimitations

III.1 Artefacts

First of all, it is important to note that in general, amodified Gabor transform is not the Gabor transformof any signal. Here, the estimated Gabor transformc2(m,n) computed with the phase-vocoder algorithmis generally not the Gabor transform of the synthesizedsignal Cfs

x2(t). This is due to the strict interdepen-dence conditions between the Gabor transform coeffi-cients of a signal in L2(R) (see [7] for details). As aconsequence, the use of c2(m,n) in the reverse Gabortransform process generates reconstruction artefacts.Other phase-vocoder drawbacks, such as aliasing, at-tack smoothing due to loss of vertical phase coherencehave already been deeply reviewed (see for instance[6]).

III.2 Limitations of an exclusive signal pro-cessing approach

Although the artefacts named in III.1 are a complexproblem in the implementation of phase-vocoder algo-rithms, several improvements have been proposed toovercome them. In the present study are only takeninto account the physical aspects that the pitch-shiftingalgorithm doesn’t cover, for it is only based on signalprocessing considerations. In fact, an underlying hy-pothesis behind the phase-vocoder transposition is thatrelationship between the partials (especially in termsof amplitude and damping) is the same for each pitch.This is obviously not the case for real-life harmonicsounds produced by physical strutures (e.g. musical

instruments). Figures (2d), (2b) provide an illustra-tion of a typical issue when transposing a note to ahigher pitch by use of the phase-vocoder: the high fre-quency energy-content of the transposed note is muchricher than the real one. This phenomena is one of thecauses of the sound-color modification trough phase-vododer transposition.

IV Gabor Mask Prototypes

IV.1 Elaboration of a Gabor mask proto-type using the spectral enveloppe evolu-tion

It is proposed to modify the Gabor transform of thetransposed note so that the damping law of the par-tials remains unchanged. To do so, a mask prototype iselaborated by computing the ratio between an estima-tion of the spectral enveloppe that the ideal transposedsound should have, and the spectral enveloppe of theactual transposed sound. The mask prototype is thenapplied to the time-stretched Gabor transform. Theprotocol is summarized on figure (4).

The basic hypothesis of the whole process presentedin this paragraph is to consider that an approximationof a Gabor mask between two Gabor transforms can begiven by the ratio of their respective spectral enveloppeevolutions. Keeping the same notations as in sectionII.2, and calling c3(m,n) the ideal Gabor transform ofthe transposed sound, a mask prototype mp is soughtas:

mp(m,n) =Env{c3}(m,n)Env{c2}(m,n)

(4)

where Env{} denotes the spectral enveloppe evolution

Time (s)

Freq

uenc

y (H

z)

do1_pp_21

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.80

1000

2000

3000

4000

5000

−90

−80

−70

−60

−50

−40

−30

−20

(a) Gabor transform modulus of a piano C (65 Hz), [5]

Time (s)

Freq

uenc

y (H

z)

do1_pp_21_quarte

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.80

1000

2000

3000

4000

5000

−90

−80

−70

−60

−50

−40

−30

−20

(b) F (87 Hz) obtained by transposing the C (2a), [9]

Time (s)

Freq

uenc

y (H

z)

Mask_do1_pp_21−do1_quarte_pp_21

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.80

1000

2000

3000

4000

5000

−30

−20

−10

0

10

20

30

(c) Modulus of the Gabor mask calculated between the F (2d)and the F obtained by transposing the C (2b)

Time (s)

Freq

uenc

y (H

z)

fa1_pp_21

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.80

1000

2000

3000

4000

5000

−90

−80

−70

−60

−50

−40

−30

−20

(d) Gabor transform modulus of a piano F (87 Hz), [8]

Time (s)

Freq

uenc

y (H

z)

Maskprt_do1_pp_21_quarte

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.80

1000

2000

3000

4000

5000

−20

−18

−16

−14

−12

−10

−8

−6

−4

−2

0

(e) Mask prototype

Time (s)

Freq

uenc

y (H

z)

do1_pp_21_quarte_maskprt

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.80

1000

2000

3000

4000

5000

−90

−80

−70

−60

−50

−40

−30

−20

(f) Result of the point-wise multiplication between the transposedF (2b) and the mask prototype (2e), [10]

Figure 2: Mask prototype (2e) elaborated considering the transposition of a piano C a quart higher, and theresult of its application (2f) to the modified Gabor transform (2b). Magnitude scales are in dB.

Time (s)

Freq

uenc

y (H

z)

m10

0 0.5 1 1.5 20

0.5

1

1.5

2

x 104

−90

−80

−70

−60

−50

−40

−30

−20

(a) Gabor transform modulus of a metal sound, [11]

Time (s)

Freq

uenc

y (H

z)

m10_sept_maj

0 0.5 1 1.5 20

0.5

1

1.5

2

x 104

−90

−80

−70

−60

−50

−40

(b) Transposition of the metal sound a major seventh higher, [12]

Time (s)

Freq

uenc

y (H

z)

Maskprt_m10_sept_maj

0 0.5 1 1.5 20

0.5

1

1.5

2

x 104

−20

−18

−16

−14

−12

−10

−8

−6

−4

−2

0

(c) Mask prototype

Time (s)

Freq

uenc

y (H

z)

m10_sept_maj_maskprt

0 0.5 1 1.5 20

0.5

1

1.5

2

x 104

−90

−80

−70

−60

−50

−40

(d) Result of the application of the mask prototype, [13]

Figure 3: Mask prototype (3c) elaborated considering the transposition of a metal sound a seventh higher andthe result of its application (3d) to the modified Gabor transform (3b). Magnitude scales are in dB.

x1(t)

��Resampling

x2(t)

��Gabor transform

c2(m,n)

��

// Estimation of thespectral enveloppe evolution

Env{c2}��

Time stretching

c2(m,n)

��

Mask prototypeelaboration

BCmp(m,n)

ooApplication of themask prototype

mpc2(m,n)

��Inverse

Gabor transform

Df [mpc2(m,n)](t)

��

Figure 4: Transposition process

expressed in the time-frequency plane. Given a conve-nient spectral enveloppe model, Env{c2}(m,n) can beextracted by observing c2, whereas Env{c3}(m,n) is tobe extrapolated from the spectral enveloppe evolutionEnv{Cfs

x1} of the source sound x1(t), for it containsinformation on the physical behaviour that is wantedto be kept unchanged through transposition.

In order to compute Env{Cfsx1}(m,n), it is pos-

sible to write the Gabor transform of x1(t) in fs, andthen apply a fitting algorithm over the Gabor trans-form modulus. But it is also possible to avoid thecomputational cost of another Gabor transform by us-ing the specific relationship between the analysis andthe synthesis frame, which allows to write:

Cfs [x1(t)](m,n) = rCfa [x1(rt)](rm, n)= rCfa [x2(t)](rm, n) (5)

Note that the notation rm can be misleading, forrm generally belongs to R whereas the time and fre-quency index (m,n) of a Gabor transform belongto Z2. But in the practical situations considered here,Cfa [x2(t)](rm, n) is never to be computed and this no-tation is only a calculus intermediary used to apply asimilar equality to the spectral enveloppe evolutions.

Indeed (5) implies that:

Env{Cfsx1}(m,n) = rEnv{Cfa

x2}(rm, n)= rEnv{c2}(rm, n)

And since the time-stretching procedure doesn’t mod-ify the modulus of the Gabor transform coefficients,one finally has:

Env{Cfsx1}(m,n) = rEnv{c2}(rm, n) (6)

In the following, an explicit model of spectral enveloppeevolution is given to Env{}. It is assumed that theenergy of the manipulated sounds present a loglineardecrease in frequency, an exponential decrease in time,and that the time-damping coefficient increases linearlywith frequency. These assumptions are coherent withthe free response of oscillating systems (such as the pi-ano) in a linear approximation. Such hypothesis implythat the spectral enveloppe evolution of c2(m,n) canbe written has :

Env{c2}(m,n) = keαn+(βn+γ)m (7)

The amplitude parameter k, and the damping param-eters α, β, γ are estimated by use of a least-squaremethod applied over the peaks of the time-streched Ga-bor transform modulus |c2|. α will be called the time-damping parameter, γ the frequency-damping param-eter, and β the compound-damping parameter. Note

that the estimation of these three damping parametersshould logically lead to negative values. Equation (6)then leads to:

Env{Cfsx1}(m,n) = rkeαn+(rβn+rγ)m

This shows that the transposition process presentedon figure (4) leads to a Gabor transform c2(m,n)that presents a frequency- and compound-damping be-haviour modified by a factor r in comparison to thesource sound, and that the temporal damping param-eter remains unchanged. Thus the ideal enveloppeEnv{c3}(m,n) can be set to :

Env{c3}(m,n) = keαn+(rβn+rγ)m

i.e. an enveloppe that has the amplitude parameterk of Env{c2}, but the damping parameters α, rβ andrγ of Env{Cfs

x1}(m,n). The mask prototype (4) cantherefore be written as:

mp = e(βn+γ)(r−1)m

The values of the mask prototype before the attackand below the fundamental frequency after transposi-tion are set to one.

It is important to note that:

• Such masks only concern amplitude modification,which means that they leave the phase of themodified Gabor transform unchanged. This ismotivated by the fact that a simple model forphase information can not be found.

• The model described above will only be used fortransposition to higher tones, i.e with r > 1.Applying this model to a transposition to lowertones would lead to a diverging mask prototypeas m and n grow. Furthermore, applying thismask would lead to an amplification of energyon non-harmonic portions of the time-frequencyplane.

IV.2 Applications

IV.2.1 Piano notes

The method described in section (IV.1) has first beenapplied to piano sounds. The result of the transposi-tion has been compared to a real recorded note. Figure(2) provides support for the analysis of the method forthe transposition of a C (65 Hz) a quart higher, atthe frequency of an F (87 Hz). Comparing the Gabortransforms of the real C and real F (resp. figure (2a)and figure (2d)), one can see that most of the energyis contained in the same frequency range, roughly be-tween 3500 Hz. However, the Gabor transform of theF obtained by transposing the C (figure (2b)) showsthat some energy is present up to 4600 Hz. Indeed, by

listening both F, one can note that the transposed Fsounds more reedy that the natural one.

Important information can also be retrieved observ-ing the Gabor mask modulus between the real F andthe transposed one (figure (2c)). First, the highest val-ues of the mask are located along the frequency chan-nels corresponding to the fundamental frequency of theF (87 Hz). This is due to the fact that, for the C, thefundamental is less energetic than the first harmonic(see figure (1a)), whereas for the F, the highest eneryis located on the fundamental. One can also note thatthe Gabor mask modulus contains values lower thanone between 0 and 4500 Hz at approximately 0.15 s.This indicates that the real F and the transposed Fwere not perfectly aligned in time. But the main rel-evant observation for the present work is that most ofthe values of the Gabor mask modulus are lower thanone, especially from 2000 Hz onwards. This is coher-ent with the observation of the two Gabor transformsmade in the previous paragraph, and provides an em-pirical justification for the mask prototypes presentedin this paper.

The mask prototype presented on figure (2e) is elab-orated by fitting the amplitude-peaks of the Gabortransform modulus of the transposed F presented onfigure (2c). As expected, the mask values get smalleras frequency and time grow, and its application to theGabor transform of the transposed F (figure (2f)) di-minishes the energy at high frequencies. The result-ing sound files are available at www.lma.cnrs-mrs.fr/

~kronland/DAFx10/. One can note that although the‘masked’ note is more perceptually convincing than thedirectly transposed one, it lacks some low frequencycontent that is present in the real F. This is probablydue to the fact that the C as a low energy on the fun-damental, which was already mentioned above. Thischaracteristic is not compatible with the spectral en-veloppe model used to elaborate the mask prototype.This example allows to point out the benefits of themethod as well as its limitations: the mask prototypescan only be used to erase an excess of energy in thetime-frequency plane, but cannot correct a lack of en-ergy.

IV.2.2 Metal sound

The method described above is not only suitable fornotes obtained with a musical instrument. It can alsobe applied to any sound for which the damping modelis consistent. In the following, the transposition ofa sound obtained by impacting a metal plate with adrumstick is considered. Although the resulting soundis not harmonic, a pseudo chromatic scale can be ob-tained by use of the transposition algorithm. The sev-eral Gabor transform modulus involved in the transpo-sition of this metal sound a major seventh higher arepresented on figure (3). On figure (3a), one can observe

that the original sound contains a high concentrationof partials up to 8500 Hz. After transposition, thesepartials are displaced up to 17000 Hz, which signifi-cantly changes the sound color. The Gabor transformon figure (3d), obtained with the mask prototype dis-played on figure (3c), exhibits an energy distributionthat is much more coherent with the source sound. Thecorresponding audio files are available at [11], [12] and[13].

V Conclusion and Perspectives

It has been shown here that a presumption on the phys-ical model that describes the transposed sounds canbe efficiently used to improve the realism of the phase-vocoder transposition, by use of Gabor mask proto-types.

The spectral enveloppe estimation being based ona peak detection algorithm and a fitting algorithm, itis necessary, in order to compute it, to set the valueof several parameters such as the minimal amplitudeof the peaks or their maximal thickness. These choicesare of importance for they influence the shape of themask prototype, and consequently the color of the re-sulting sound. A deep investigation on the influenceof the underlying algorithms parameters over the re-sulting sound color would provide more stability to thetransposition algorithm, as well as more flexibility.

For more complex sounds, which spectral enveloppeevolution cannot be described by the model (7) usedin this paper, a more sophisticated spectral-enveloppemodel could be developed. This again, would allowto use the transposition algorithm in a wider range ofsituations.

VI Acknowledgment

The authors would like to thank Benoıt Djerigian forproviding the piano sounds database, and Mitsuko Ara-maki for providing an environmental sound database.The Linear Time/Frequency Toolbox written by PeterSoendergaard has also been of great use for the numer-ical implementations of the protocol.

References

[1] Karlheinz Grochenig. Foundations of Time-Frequency Analysis. Birkhauser, 2001.

[2] Monika Dorfler and Bruno Torresani. On the time-frequency representation of operators and gener-alized Gabor multiplier approximations. Preprintsent to Elsevier, 2007.

[3] Philippe Depalle, Richard Kronland Martinet,and Bruno Torresani. Time-frequency multipli-ers for sound synthesis. In SPIE, editor, WaveletXII SPIE annual Symposium Wavelet XII, volume6701 of Proceedings of the SPIE, pages 670118–1– 670118–15, San Diego USA, 2007. OR 20.

[4] Monika Dorfler and Bruno Torresani. Spread-ing function representation of operators and Ga-bor multiplier approximation. http://hal.archives-ouvertes.fr/hal-00146274/en/.

[5] http://www.lma.cnrs-mrs.fr/~kronland/Dafx10/

piano_pp_21_65_Do1.wav.

[6] Jean Laroche and Mark Dolson. Improved phasevocoder time-scale modification of audio. IEEETransactions on Speech and Audio Processing,Volume ASSP-32, n◦ 2: 236–242, April 1984.

[7] Daniel W. Griffin and Jae S. Lim. Signal estima-tion from modified short-time fourier transform.IEEE Transactions on Speech and Audio Process-ing, Volume 7, n◦ 3: 323–332, May 1999.


piano_pp_21_87_Fa1.wav.


do1_pp_21_quarte.wav.


do1_pp_21_quarte_maskprt.wav.


m10_expe.wav.


m10_sept_maj.wav.


m10_sept_maj_maskprt.wav.

Documents

Adjusting the spectral envelope evolution of transposed ...kronland/Dafx10/Article.pdf · Adjusting the spectral envelope evolution of transposed sounds with Gabor mask prototypes