Objective analysis of the singing voice as a training aid

ORIGINAL ARTICLE

Objective analysis of the singing voice as a training aid

T. SANGIORGI1, C. MANFREDI2 & P. BRUSCAGLIONI1

1Dept. of Physics, Universita degli Studi di Firenze and Istituto Nazionale per la Fisica della Materia, Firenze, Italy, 2Dept. of

Electronics and Telecommunications, Universita degli Studi di Firenze, Firenze, Italy

AbstractA new tool for robust tracking of fundamental frequency is proposed, along with an objective measure of main singing voiceparameters, such as vibrato rate, vibrato extent, and vocal intonation. High-resolution Power Spectral Density estimation isimplemented, based on AutoRegressive models of suitable order, allowing reliable formant tracking also in vocalizationscharacterized by highly varying values. The proposed techniques are applied to about 1000 vocalizations, coming from bothprofessional and non-professional singers, and show better performance as compared to classical Fourier-based approaches.If properly implemented, and with a user-friendly interface, the new tool would allow real-time analysis of singing voice.Hence, it could be of help in giving non-professional singers and singing teachers reliable measures of possibleimprovements during and after training.

Key words: AR models, formants, fundamental frequency, objective parameter evaluation, parametric PSD, singing voice

analysis, vibrato extent, vibrato rate

Introduction

This paper aims at contributing to the objective

analysis of singing voice, in order to give non-

professional singers either an aid to improve their

voice capabilities or a criterion to prevent a wrong

vocal attitude (positioning, posture, etc.) that could

even cause vocal pathologies.

The need for objective singing voice analysis arises

from the fact that singing learning is mainly based on

the perception made by the singer and teacher

during his/her performance. It is still mainly up to

the singing teacher to evaluate the quality of a

performance, as at present several effective visual

aids are available (1�/6), but few of them are

supported by robust objective means to evaluate

singing capability and/or improvements. Specifically,

robust, high-resolution and adaptive techniques are

required for this application. The aim of the present

work is to provide such techniques, which could also

be successfully integrated into already existing tools.

Our study involved both professional and non-

professional male and female singers (40% male and

60% female), with ages ranging between 20 and 45

years, that self-declared to be in good health. About

15% of them were smokers.

Professionals (baritone, tenor, contralto, mezzo-

soprano, soprano) are Western opera singers and

singing teachers in music academies (Firenze and

Perugia, Italy) and in private institutes in the

Tuscany region, Italy. Non-professionals are mem-

bers of the Firenze University chorus, exhibiting

non-homogeneous singing technique skills, rang-

ing from the very beginner to almost-professional

singers.

Our study represents the first step within a wider

project that consists in evaluating and monitoring

the development of singing capabilities of the

members of the Firenze University chorus.

For comparison, the most important Western

opera singing techniques were considered, such as:

sustained vowels at different frequencies, vibrato

(periodic fundamental frequency modulation of

about two semitones), trillo (periodic fundamental

frequency modulation of about four semitones),

glissando (uniform ascending or descending scale),

messa di voce (increasing and lowering the intensity

Correspondence: Claudia Manfredi, Department of Electronics and Telecommunications, Faculty of Engineering, Universita degli Studi di Firenze, Via S.

Marta 3, 50139 Firenze, Italy. Fax: �/39-055-494569. E-mail: [email protected]

Logopedics Phoniatrics Vocology. 2005; 30: 136�/146

ISSN 1401-5439 print/ISSN 1651-2022 online # 2005 Taylor & Francis

DOI: 10.1080/14015430500294064

Log

oped

Pho

niat

r V

ocol

Dow

nloa

ded

from

info

rmah

ealth

care

.com

by

UB

Mai

nz o

n 10

/25/

14Fo

r pe

rson

al u

se o

nly.

of a sustained vowel at constant pitch). About 1000

recordings were analysed. Notice that some of these

techniques, such as trillo and messa di voce, require

high capabilities. Hence, they were only performed

by highly skilled singers.

By analysing professional singers, as well as

commercial recordings of famous singers, and on

the basis of the results proposed in literature, we

have evaluated some key parameters, along with

their range of variation, that could objectively

characterize a well-performed vocal exercise.

Specifically, to determine the quality of a vibrato

vocalism, vibrato rate (i.e., the rate of the vibrato

modulation, expressed in cycles/s), vibrato extent,

vocal intonation should be evaluated along with their

standard deviations. Similarly, in a glissando, the

range of frequencies reached with a specific vocal

register without voice breaks should be measured.

Finally, by analysing a sustained vowel or messa di

voce without vibrato, it should be pointed out how

properly the singer is capable of ‘hitting the note’

and ‘keeping it’.

Moreover, in all the cited techniques, it should be

determined how appropriate was the vocal articula-

tion. This has been obtained by means of robust

techniques for spectral analysis (pitch and formant

tracking), capable of dealing with such highly vary-

ing signal parameters as used in this study.

Analysis techniques

The main features of singing voice comprise funda-

mental frequency (F0) of vocal fold oscillation,

which is directly related to pitch, along with its

modulation in time and frequency, and formants,

i.e., the resonance frequencies of the vocal tract,

along with their energy.

Due to high signal variability, robust analysis

techniques were implemented, capable of following

fast and huge fundamental frequency variations,

typical of some vocalization. Moreover, parametric

techniques for high-resolution formant estimation

were applied, based on AutoRegressive (AR) models

of suitable order, linked to singer’s sex and to the

signal sampling frequency Fs. With AR models, the

signal s(n) at time instant n is described by a linear

combination of its p past values, plus a noise term

e(n):

s(n)�a1s(n�1)�a2s(n�2)� . . .�aps(n�p)

�e(n) (1)

In Equation 1, the unknowns are the model order

p and the parameters ai, with i�/1, . . ., p. This

problem can be solved with the approach based on

system identification theory (7).

Classical techniques based on Fourier transform

were also considered, to compare results in terms of

robustness and resolution capability, especially when

applied to short data frames.

Fundamental frequency estimation

The fundamental frequency F0 is estimated here by

means of a two-step procedure. The choice of the

techniques adopted in each step results from a

detailed comparative analysis of pitch extraction

methods (8).

Simple Inverse Filter Tracking (SIFT) is applied

first, on signal time windows of short and fixed

length M. The window length is chosen as M�/3Fs/

Fmin, where Fs is the signal sampling frequency, and

Fmin is the minimum allowed F0 value for the signal

under consideration (here: Fmin�/50 Hz, corre-

sponding to very low male pitch). A short time

window is required, due to possible high non-

stationarity of the signal under study.

The SIFT approach follows the basic strategy of

pre-whitening the speech signal followed by auto-

correlation, the pre-whitening step involving the use

of Linear Prediction (LP) based Inverse Filtering

(IF). To create an IF, a low-order LP is usually

selected (order p:/4), since no more than two

formants are expected in the low-passed signal

frame. However, for highly varying signals such as

those under study, an adaptive choice for the filter

order is proposed here, based on Singular Value

Decomposition (SVD) of matrices whose entries

come from sampled speech data frames, properly

organized (9,10). Notice that SVD requires selecting

the ‘size’ p of the signal subspace, i.e., the minimum

number of eigenvectors spanning the ‘meaningful’

data. To this aim, a variable threshold is defined,

based on the Dynamic Mean Evaluation (DME)

criterion, based on the geometric distance among

‘large’ and ‘small’ singular values, as it was found

more parsimonious than classical approaches (11).

The DME is applied to the decreasing sequence of

singular values s2i . Typically, with DME, 25/p5/6

during the utterance, due to changing signal char-

acteristics: the larger the estimated p, the more

varying the signal.

From the first step, a first raw F0 tracking is

obtained, along with its range of variation [Fl,Fh].

The second step gives a more accurate F0 estima-

tion and allows defining the optimum and possibly

varying window length (time window, TW), also

used for formant estimation, as will be explained in

the next section.

F0 is now adaptively estimated in the frequency

range [Fl, Fh], obtained in the previous step. More-

over, data frames are overlapped for 3/4 of length, in

Singing voice analysis 137

Log

oped

Pho

niat

r V

ocol

Dow

nloa

ded

from

info

rmah

ealth

care

.com

by

UB

Mai

nz o

n 10

/25/

14Fo

r pe

rson

al u

se o

nly.

order to track both F0 and formant values more

accurately. This step applies Continuous Wavelets

Transform (CWT) and Average Magnitude Differ-

ence Function (AMDF), on short time windows of

varying length (three pitch periods, inversely pro-

portional to previously estimated local F0). To

perform F0 estimation, the signal is band-pass

filtered (50 Hz�/1000 Hz, or more) with a proper

Continuous Wavelet Transform (Mexican hat) (12�/

14) and its periodicity is extracted by means of

the Average Magnitude Difference Function

(AMDF) approach. The AMDF analysis, which is

directly carried out on signals in the time domain,

can be used to detect fast and slow variations of the

fundamental frequency F0 (15). The choice of

the AMDF instead of the autocorrelation sequence

(AS) is due to the non-stationarity and amplitude

modulation of the signals under study, that were

shown to often cause misestimation of the true s

ignal periodicity with the AS (8). For fast and abrupt

F0 changes, this procedure was shown to increase

robustness in F0 estimation, giving enhanced results

with respect to standard ones (16,17).

Summarizing, the full procedure for F0 estimation

is the following:

. The signal is band-pass filtered in the range 50

Hz�/1000 Hz (i.e., the F0 range for the singing

voices under study); this range can be in-

creased, if required.

. On each time window of length 3Fs/F0min, SVD

and SIFT are applied, and a first rough F0

estimate is obtained, along with its range of

variation Fl5/F05/Fh.

. Inside [Fl,Fh], on each TW of varying length

TW�/3Fs/F0 (F0 estimated in the previous

step), a coefficient matrix for the CWT(h,s) is

obtained, where h is the shift parameter and s is

the scale one.

. From the coefficient matrix CWT(h,s) the

optimum scale value, /s

˘

, is selected as the one

corresponding to the maximum entry, which

represents the best fitting of the wavelet to data.

. The AMDF technique is applied to CWT(h,/s

˘

)

thus obtaining the estimate of F0 as: F0�/

Fs/hmin, where hmin is the AMDF minimum.

These steps are described in detail, see (16,18,19).

Vibrato estimation

Vibrato singing technique consists of an almost

periodic F0 modulation. In professional singers,

vibrato profile often shows a sinusoidal, triangular

or a generally well defined periodic shape. The

mechanism of vibrato production is not yet comple-

tely known. Fisher (20) explained it considering the

presence of two phenomena, glottal wave and

breathing wave.

According to the literature (6), three parameters

can be evaluated for vibrato characterization:

1. Vibrato rate (VRate), which represents the

number of F0 oscillation in time (cycles/s). In

this work, it has been evaluated as the recipro-

cal of the mean time difference between two

subsequent maxima:

VRate�1

N

XN�1

i�1j 1

tMax

i�1� tMax

ij (2)

where tiMax�/i-th maximum and N�/number of

maxima.

2. Vibrato extent (VExtent), which is the differ-

ence in frequency between a maximum and a

minimum within a cycle (Hz). Here, it has been

obtained as the mean of the differences:

VExtent�XN

i�1

(fMaxi � fMin

i )

N(3)

where fiMax and fi

Min are the frequency values of

the i-th maximum and minimum respectively,

in each vibrato cycle.

3. Vocal intonation (MF0), which is the trend of

the mean of the difference in frequency be-

tween a maximum and a minimum in the ith

cycle (21):

MF0i�fMax

i � fMini

2(4)

This parameter has received less attention in

literature with respect to the other two, and hence

has not been deeply investigated in this study.

For each parameter, standard deviation was also

evaluated, as a measure of the quality of the

performance.

The proposed algorithm exploits the F0 values

obtained as described in the previous section. Notice

that F0 is evaluated as the mean value over three

50% overlapped pitch periods, each of about 5�/7 ms

duration, hence on time windows of 15�/20 ms at

most. A vibrato cycle is about ten times longer,

hence the averaging effect is negligible.

Evaluating VRate, VExtent and MF0 required

setting up a routine capable of correctly finding

absolute maxima and minima, i.e., the maximum

and minimum value in each vibrato cycle. The

implemented routine should also be capable of

dealing with both professional and non-professional

singers. In fact, the irregularity of the vibrato profile

138 T. Sangiorgi et al.

Log

oped

Pho

niat

r V

ocol

Dow

nloa

ded

from

info

rmah

ealth

care

.com

by

UB

Mai

nz o

n 10

/25/

14Fo

r pe

rson

al u

se o

nly.

carries out maxima and minima with varying ampli-

tude in each cycle, therefore a robust criterion to

discriminate between absolute and relative maxima

or minima is required.

As shown in Figure 1, concerning professional and

non-professional baritone vocalizations (sustained /

a/ vowel), while the vocalization by professionals

presents a regular F0 profile (Figure 1a), non-

professionals and, especially, amateurs (Figure 1b)

exhibit an irregular profile.

Notice also that a threshold criterion, such as F0

mean value, to separate the set of maxima from that

of minima, cannot be used, due to a possibly

irregular vibrato profile. A threshold criterion cannot

be used in trillo technique either, due to its peculiar

profile (Figure 1c).

Taking into account the above mentioned difficul-

ties, the new procedure for absolute maxima identi-

fication is as follows:

. All F0 values are possible candidates as the

absolute maximum in the cycle. A routine

compares the value of the F0 candidate to

that of a selected set of values around the

candidate.

. If the value of the candidate is the largest one in

the set, then it is selected as the absolute

maximum within that cycle, otherwise the

next value becomes the new candidate.

. The size of the set of points around the

candidate value (the frame length) needs to be

adequately chosen, to prevent the presence of

0 0.5 1 1.5 2 2.5 3

–0.5

00.

51

Sustained /a/ with vibrato

Time [s]

norm

.am

pl.

0 0.5 1 1.5 2 2.5 3130

140

150

160

170

180

Time [s]

Mean F0 = 154.1127 Hz

F0

[Hz]

(a)

0 0.5 1 1.5 2

–1–0

.50

0.5

1

–1–0

.50

0.5

1


Time [s]

0 0.5 1 1.5 2

180

185

190

195

200

205

Time [s]

Mean F0 = 191.8141 Hz

norm

. am

pl.

F0

[Hz]

(b)

0 1 2 3 4 5 6

Sustained /e/ with trillo

Time [s]

0 1 2 3 4 5 6

200

220

240

260

280

300

320

Time [s]

Mean F0 = 258.3974 Hz

norm

. am

pl.

F0

[Hz]

(c)

Figure 1. a: vibrato, professional baritone; b: vibrato, non-professional baritone); c: trillo, professional baritone.


Log

oped

Pho

niat

r V

ocol

Dow

nloa

ded

from

info

rmah

ealth

care

.com

by

UB

Mai

nz o

n 10

/25/

14Fo

r pe

rson

al u

se o

nly.

two maxima pertaining to different vibrato

cycles, and corresponds to about 70 ms, i.e.,

the time range roughly corresponding to half of

a cycle, for a vibrato rate of 7 cycles/s. However,

in a few cases, where the maxima and minima

identification is unclear, this value could be

manually set up after visual inspection (see

Figure 2a and 2b).

The procedure for finding absolute minima fol-

lows the same guidelines.

Maxima and minima evaluation, along with vocal

intonation tracking, are presented in Figure 2.

Specifically, Figure 2a and 2c are relative to the

professional baritone (see Figure 1a), while Figure

2b, 2d concern the non-professional baritone (see

Figure 1b).

Figure 2 clearly shows maxima and minima

identification, as well as the trend of vocal intona-

tion, also for irregular vibrato.

Formant estimation

Two methods have been used to detect the reso-

nances of the vocal tract, named formants: a para-

metric approach, based on AR models for the vocal

tract filter, and a classical one based on Fourier

transform (FT). One of the main advantages of

0 0.5 1 1.5 2 2.5 3

146

148

150

152

154

156

158

160

162

Time [s]

* = maxima ; o = minima

0 0.5 1 1.5 2187

188

189

190

191

192

193

194

195

196

197

198

Time [s]

* = maxima ; o = minima

0 0.5 1 1.5 2 2.5 3

Time [s]

Vocal Intonation

0 0.5 1 1.5 2Time [s]

Vocal Intonation

F0

[Hz]

146

148

150

152

154

156

158

160

162

F0

[Hz]

F0

[Hz]

187

188

189

190

191

192

193

194

195

196

197

198

F0

[Hz]

(a) (b)

(c) (d)

Figure 2. a and b: maxima and minima identification; c and d: vocal intonation.


Log

oped

Pho

niat

r V

ocol

Dow

nloa

ded

from

info

rmah

ealth

care

.com

by

UB

Mai

nz o

n 10

/25/

14Fo

r pe

rson

al u

se o

nly.

parametric spectral analysis over classical approaches

consists in its high-resolution capability, as the

model extrapolates data outside the analysed win-

dow. Hence, better results are achieved with respect

to classical spectral estimators, where the spectral

resolution is limited by data windowing and side

lobes (7). Parametric formant estimation relies on a

vocal tract model made up by an interconnected

series of p cylindrical coaxial lossless cavities of

different length and diameter. The resonances (for-

mants) can be recovered from the maxima of the

Power Spectral Density (PSD), given by:

PSD(f)�T

j1 �Xp

k�1

ake�j2pf j2(5)

where T is the sampling period and ai, i�/1,. . .,p, is

the coefficient of the AR model of order p that

describes the vocal tract (15,22). The formant

tracking procedure was tested on high-pitched

synthesizes signals (14,23), quite similar to the

ones under study here. Robustness to noise and

resolution capabilities were tested, also in case of

almost non-stationary signals, with very good results

as compared to non-parametric approaches.

Notice that AR spectral estimators are sensitive to

order selection: in case of overestimated model order

p, formant splitting may occur, while underestima-

tion smoothes the spectrum and causes misalloca-

tion of spectral peaks. Many criteria have been

defined for finding the best model order p, including

both the estimated variance s2 and the model

complexity p in one set of statistics. Such criteria

are characterized by loss functions for which a

minimum can be achieved. However, they were

shown to be almost unreliable for short data frames,

due to long-term convergence properties. In this

paper, the relation p:/Fs (Fs�/signal sampling

frequency, in kHz) was found the best one for

obtaining a sufficiently detailed spectrum.

This relation comes from the physical constraint:

Fs�/pc/2L, where L is the length of the vocal tract

and c is the speed of sound (22). In fact, in order to

adequately represent the vocal tract model by means

of the polynomial A(z), its ‘memory’ (i.e., its order)

must be equal to twice the time required for sound

waves to travel from the glottis to the lips, that is,

2L/c. For the adult male, L$/17cm, hence, with

c�/34 cm/ms, the necessary memory amounts to 1

ms. Thus, with a sampling frequency Fs�/10 kHz,

the filter order p must be at least 10. The higher the

Fs, the higher the value of p. For female voices, L

being smaller (about 14 cm), a lower value of p

should be selected, around 3/4 Fs. This choice was

added to the formant estimation technique.

Notice that the choice of a model order near or

equal to Fs, prevents spectral smoothing and conse-

quently loss of spectral peaks. This approach has

already been proved effective in many applications,

with enhanced results as far as resolution is con-

cerned (11,19), and is of utmost importance in the

present study, that requires exact formant evalua-

tion. For example, spectral resolution is of great

relevance in Western opera singing, as male singers

use a characteristic vocal articulation to realize a

cluster of the third, fourth and fifth resonances,

which allows them to increase loudness and reach a

specific vocal timbre (6).

Among possible choices, the proposed approach

performs the minimization of the average of the

forward and backward squared prediction errors

over the available data. This approach is commonly

named ‘modified covariance method’, as it was

shown to give the best results as far as reduction of

spectral line splitting and bias of the frequency

estimate are concerned (7,24).

Experimental results

Our study concerns five professional singers: one

baritone for male voices, one contralto, one mezzo-

soprano and two sopranos for female voices. The

baritone and the sopranos were trained in lyric

repertoires, the others in baroque music.

Non-professional singers come from the Firenze

University Choir, which is made up of about 80

singers, with non-homogeneous training, ranging

from the very beginner to the almost expert. About

1000 recordings were obtained from 20 non-profes-

sional singers (2 bassos, 3 baritones, 3 tenors, 2

contraltos, 3 mezzo-sopranos, 7 sopranos). For each

singer, about 40 sung tokens were recorded, corre-

sponding to different vocalizations and different

vowels.

Computations were carried out under

MatlabR12† development environment. Processing

time is low for most signals (2�/3 s length). Speci-

fically, about 1 min for F0 and related parameters,

and another 1 min for spectrogram, formants and

PSD on a standard PC. The longer the signal, the

longer the processing time. However, implementing

the software under C�/�/ or Assembler language

should allow for real-time processing.

In this section, some general remarks are made,

concerning the singing voice features obtained with

the proposed approach, when applied to both

professional and non-professional singers. The re-

ported results are relative to the whole data set under

study. Moreover, two examples are given to compare

results relative to a non-professional and an almost

professional tenor singer, as far as the vibrato


Log

oped

Pho

niat

r V

ocol

Dow

nloa

ded

from

info

rmah

ealth

care

.com

by

UB

Mai

nz o

n 10

/25/

14Fo

r pe

rson

al u

se o

nly.

technique is concerned. Objective parameters and

pictures allow objectively quantifying the better

performance of the second singer with respect to

the first one.

Firstly, for professional singers, the features of a

properly performed vocal technique were analysed.

This is in fact of importance, as such features could

be considered as a reference set for non-profes-

sionals. These results were also compared to those

proposed in the literature.

As previously mentioned, main singing techniques

comprise: sustained vowels at different frequencies,

vibrato, trillo, glissando, and messa di voce.

To evaluate the quality of the sustained vowel and

messa di voce techniques, F0 mean was considered

to represent the height of the note (i.e., its fre-

quency). Two more parameters were also consid-

ered: F0 standard deviation (s) and the difference

between the F0 maximum and the F0 minimum in

the vocalization (D):

D�F0max�F0min (6)

For a well-performed sustained vowel, it was

found that sB/1 Hz and DB/5 Hz.

In the glissando technique, attention was focused

on the vocal extension, and in particular on the

frequency range that a singer was capable of

performing in a specific vocal register without voice

breaks.

On the contrary, non-professionals, and especially

amateurs, often exhibited a limited vocal extension,

voice breaks and an improper use of vocal registers.

This means that they were almost unable to perform

a wide range of frequencies with the register required

by the musical repertory. Specifically, in ascendant

glissandos performed by non-professional tenors,

voice breaks were found between head voice and

falsetto in the range of frequencies 340 Hz�/400 Hz.

Moreover, in Western opera, the use of falsetto for

frequencies less than 450 Hz is generally not used.

As already said, the main parameters used to

characterize vibrato technique are Vibrato Rate

(VRate), Vibrato Extent (VExtent), Vocal Intonation

(MF0) along with their standard deviations (Equa-

tions (2)�/(4)). According to the literature (6), a

good quality vibrato has a VRate value generally

pertaining to the range 5.5�/7.5 cycles/s, and a

VExtent value of less than 2 semitones. However,

VRate is a parameter that seems to vary according to

age, sex, and emotional status of the singer (25).

Recent studies (21) have shown how vibrato rate

could also vary according to the frequency of the

note performed. The professional singers that were

analysed showed VRate values varying between

4.5�/7 cycles/s, with a standard deviation sVRB/

0.36 cycles/s, and a VExtentB/2 semitones, with a

standard deviation sVEB/5 Hz .

These values were also found in the analysis of

messa di voce executed with vibrato by professionals,

where vibrato was present in the central part of the

vocalization.

Notice that most professionals and non-profes-

sionals analysed with our approach presented a

vibrato rate which varied according to the frequency

of the note performed, in agreement with Sundberg’s

results (21).

Professionals declared that changing the vibrato

rate helped them ‘to free their voice’ in the mechan-

ism of vibrato production.

In non-professionals, vibrato rate could also

vary according to the kind of vowel performed.

This is due to the fact that non-professionals,

especially amateurs, have not yet reached a high

level of technique, and consequently they have not

enough self-confidence in performing. Finally, no-

tice that few amateurs could not perform vibrato

technique on all the seven vowel sounds of the Italian

language.

Moreover, non-professionals, especially male

amateurs, showed a vibrato extent less than one

semitone. This feature could be due to the fact that

choir members are taught to use a smaller vibrato in

order to avoid strong interference with other singers.

Some remarks can also be made concerning a

possible relationship between vibrato waveform and

vocal register, a subject which is still not completely

investigated and explained in literature. With our

analysis, a link between these features was found in

some vocalizations of the Italian vowel /a/ for a

professional mezzo-soprano. In fact, performing a

vibrato in chest register led to a triangular-shaped

waveform, while producing a vibrato in mixed

register led to a sinusoidal-shaped waveform. This

seems to be connected to the mechanism of vibrato

production and should be further investigated.

Finally, as far as trillo technique is concerned, we

remark that, according to the literature, a good

quality trillo should have VRate values in the

range 5.5�/7.5 cycles/s and VExtent less than 4

semitones (6).

Professional singers were free to perform different

trillo at several frequencies. By analysing trillo

technique in a professional baritone, commonly

three different parts of the trillo could be found: an

introduction, a ‘body’, and an end (Figure 1c).

These three parts showed peculiar features. Dur-

ing the introduction, there is a lower vibrato rate,

which seems useful to emphasize the following part

of the vocalization, but with the same vibrato extent

as in the body of trillo. The body and the end have


Log

oped

Pho

niat

r V

ocol

Dow

nloa

ded

from

info

rmah

ealth

care

.com

by

UB

Mai

nz o

n 10

/25/

14Fo

r pe

rson

al u

se o

nly.

similar vibrato rates, but different vibrato extent,

which is smaller in the end.

The professionals examined with our technique

showed VRate values between 4.5�/7 cycles/s, with a

standard deviation sVRB/0.37 cycles/s, and a

VextentB/4 semitones, with a standard deviation

sVRB/5 Hz, in agreement with the ranges previously

indicated.

The following examples concern the vocalizations

of two tenors that show different vocal skills.

Specifically, the first one is a one-year trained singer

while the second is a three-year trained singer, who

also attends individual singing lessons. The goal is to

appreciate in an objective way the different vocal

capabilities throughout the analysis of a vibrato sung

token on Italian vowel /a/. To this aim, the two

singers were selected such that they have comparable

F0 mean values (around 350 Hz, for sustained /a/

vibrato sung token).

For each vocalization, the results concerning F0

estimation and formant tracking (labelled F1, F2,

F3, etc.) are presented. Specifically, the first plot

shows both the audio signal (Fs�/44.1 kHz, 16 bit

resolution), and F0 tracking, as obtained with the

proposed approach.

The second plot shows the signal spectrogram

(frequency versus intensity versus time), obtained

using FT with a Hann window of length equal to TW

(given by the second F0 estimation step) and 1/3

overlap. Formant trajectories, as obtained by the

proposed high-resolution AR parametric technique,

are overlapped on the spectrogram.

The third picture concerns Power Spectral Den-

sity (PSD). PSD has been estimated both by classical

Fast Fourier Transform (FFT) and parametric AR-

PSD (Equation (5)), with a model order p�/Fs,

according to the considerations made in the section

devoted to formant estimation. Plots are overlapped,

in order to show the high-resolution capability of the

parametric approach.

Notice that the PSD plot represents the mean

(averaged on all the signal frames) relative to the

maximum value of the PSD, PSDmax, i.e.:

PSD�10log10

Mean(PSD(frame))

PSDmax

(7)

and hence it appears as smoothed. However, PSD

maxima are clearly found with the proposed ap-

proach, along with their energy. Notice also that

‘local’ PSD plots are obtained on each signal frame

during computations, and can be individually in-

spected, if required.

Figure 3 concerns the first tenor (singer with only

one year training), and are relative to a vibrato

vocalization.

As shown in Figure 3a, the vibrato profile is only

approximately sinusoidal. The following results per-

tain to this vocalization:

F0Mean�345:01 Hz; sMF0�0:60 Hz

VRate�6:25 cycles=s; sVR�0:46 cycles=s;

VExtent�10:75 Hz; sVE�3:41 Hz

The singer shows a vibrato rate that is within the

expected range of values, but a sVR that is higher

than the standard deviation associated with a good

quality performance. The irregularity of the vibrato

is also perceived at listening. Moreover, the vibrato

rate results are much smaller than the reference

value (about 2 semitones). As previously pointed

out, this could be due to the fact that choir members

are taught to use a smaller vibrato to avoid interfer-

ing too much with other voices.

Though non-professional, the singer performs

with good vocal intonation, as described by the low

values sMF0.

Figure 3b represents the signal spectrogram, i.e.,

the distribution of the signal energy versus time. The

time window length is the optimal one, as obtained

with F0 estimation, and corresponds to three pitch

periods. Decreasing the time window length causes

lowering of frequency resolution, while increasing it

causes the reverse effect, with better clarity of the

harmonics. It is evident that the range of frequencies

up to about 1500 Hz is a high-energy range.

However, from the analysis of the spectrogram

only, it is rather difficult to find out formants

location. Hence, formant trajectories, as obtained

with parametric AR models, are superimposed on

the spectrogram. The proposed approach succeeded

in resolving formants F3 and F4, even if this is quite

difficult, as they are very close to each other due to

cluster. The fifth formant (F5) is not involved in the

cluster and it is not clearly tracked in the picture, due

to its low energy.

The PSD plots, evaluated both with FT and AR

model of order 44 (corresponding to Fs), are

represented in Figure 3c. While FT is only capable

of detecting the harmonics of the spectrum, the

high-resolution AR technique allows easier finding of

possible formant positions, even those very close in

frequency.

Specifically, the first (F1), the second (F2) and the

fourth (F4) formants are clearly detected at about

800 Hz, 1200 Hz and 3200 Hz, respectively. The

third formant, F3, corresponds to a lower energy

level with respect to the others. It is located around

2700 Hz and realizes a cluster with the fourth

formant (within the range 2500 Hz�/3200 Hz).

However, in spite of the cluster production, the

energy level associated with F3 and, especially, F4-

results much lower than the energy associated with


Log

oped

Pho

niat

r V

ocol

Dow

nloa

ded

from

info

rmah

ealth

care

.com

by

UB

Mai

nz o

n 10

/25/

14Fo

r pe

rson

al u

se o

nly.

the first two formants. In male singing voices, this

corresponds to improper vocal articulation that leads

not only to lower loudness, but also to voice

characteristics that are judged by teachers as not

suitable for the lyrical repertoire.

In conclusion, the first singer has not yet devel-

oped a good vocal technique. He cannot easily

handle the mechanism of vibrato production, and

he is not yet able to realize a proper vocal articula-

tion and thus requires further training.

Figure 4 concerns a singer who has had three years

of training, for a sung sustained /a/ with vibrato.

As it is shown in Figure 4a, the vibrato profile is

now almost sinusoidal. The following objective

parameters were obtained for this vocalization.

F0Mean�340:51 Hz;

sMF0�1:17 Hz:

VRate�5:45 cycles=s; sVR�0:27 cycles=s;

VExtent�13:49 Hz; sVE�2:16 Hz

The singer has very good values for vibrato rate.

However, in this vocalization the vibrato extent is

lower than the reference value (6).

The signal spectrogram (Figure 4b) shows two

high-level energy ranges of frequencies, the first one

lying between 500 Hz and 1500 Hz, and the second

between 3000 Hz and 4500 Hz. In both of them the

harmonics modulation caused by vibrato production

is clearly visible. Notice that, once again, the

spectrogram alone does not allow a precise formant

0 0.2 0.4 0.6 0.8 1

–1–0

.50

0.5

1


Time [s]

0 0.2 0.4 0.6 0.8 1320

360

350

340

330

370

Time [s]

Mean F0 = 345.0112 Hz

norm

.am

pl.

F0

[Hz]

–70

–60

–50

–40

–30

–20

–10

0

10

20

30

Time [s]

Spectrogram and formants (*) - Time window length = 3Fs /F0

0 0.2 0.4 0.6 0.8 1

010

0020

0030

0040

0050

0060

0070

0080

00

Sam

plin

g fr

eq. F

s=44

.1 (

kHz)

Fre

q. [H

z]

0 1000 2000 3000 4000 5000 6000–50

–45

–40

–35

–30

–25

–20

–15

–10

–50

Dashed: FFT; solid: AR(44)

Freq. [Hz]

Mea

n P

SD

[dB

]

(a)

(b) (c)

Figure 3. Non-professional tenor, vibrato for sustained /a/ vowel. a: signal amplitude and F0 tracking; b: spectrogram and formant tracking

(grey scale in dB); c: PSD mean (dashed�/FFT; solid�/AR).


Log

oped

Pho

niat

r V

ocol

Dow

nloa

ded

from

info

rmah

ealth

care

.com

by

UB

Mai

nz o

n 10

/25/

14Fo

r pe

rson

al u

se o

nly.

identification. Formant trajectories are superim-

posed on the spectrogram for clarity.

Figure 4c shows the PSD with formants, as

obtained by FT and the AR parametric technique,

along with their energy. Specifically, the first five

formants, F1�/F5, are clearly found and resolved

(at about 800 Hz, 1200 Hz, 3100 Hz, 3500 Hz,

4000 Hz). Moreover, resonances F3�/F5 give rise to

a very high energy cluster, whose energy level is

comparable to the energy level of the first two

resonances.

This formant strategy leads to the well known

‘singing formant’ (6,26,27). This high-energy cluster

is widely used by male singers to gain loudness and

to realize the Western opera voice timbre. The

trained singer performed a really good vocal techni-

que also in other vocalizations. In conclusion, these

results confirm that he has achieved an excellent

vocal articulation.

Conclusions

In this work, objective indices for singing voice

analysis are proposed. Due to high signal variability,

robust analysis techniques are implemented, capable

of following fast and huge fundamental frequency

variations, typical of some vocalization. Moreover,

vibrato rate and extent are evaluated, in order to give

the singer useful information concerning the degree

of achieved professional level, as compared to

professional ones.

Parametric techniques for high-resolution formant

estimation are applied, based on AutoRegressive

(AR) models of suitable order, linked to singer’s

0 0.5 1 1.5 2

–0.5

00.

51


Time [s]

0 0.5 1 1.5 2300

320

340

360

380

Time [s]

Mean F0 = 340.5123 Hz

norm

. am

pl.

F0

[Hz]

(a)

–80

–60

–40

–20

0

20

Time [s]

Spectrogram and formants (*) - Time window length = 3Fs/F0

0 0.5 1 1.5 2

010

0020

0030

0040

0050

0060

0070

0080

00

Sam

plin

g fr

eq. F

s=44

.1 (

kHz)

Fre

q. [H

z]

(b) (c)

0 1000 2000 3000 4000 5000 6000

–50

–45

–40

–35

–30

–25

–20

–15

–10

–50

Dashed: FFT; Solid: AR(44)

Freq. [Hz]

Mee

n P

SD

[dB

]

Figure 4. Three-years trained tenor, vibrato for sustained /a/ vowel. a: signal amplitude and F0 tracking; b: spectrogram and formant

tracking (grey scale in dB); c: PSD mean (dashed�/FFT; solid�/AR).


Log

oped

Pho

niat

r V

ocol

Dow

nloa

ded

from

info

rmah

ealth

care

.com

by

UB

Mai

nz o

n 10

/25/

14Fo

r pe

rson

al u

se o

nly.

sex and signal sampling frequency Fs. This allows

accurate testing for possible singer formant develop-

ment and formant tuning production, as well as

quantifying their energy. Fundamental frequency

and formant tracking, along with PSD and spectro-

gram plots, give easily readable results. Sustained

vowels, vibrato, trillo, glissando and messa di voce

were analysed, from baritone, tenor, contralto,

soprano and mezzo-soprano singers. About 1000

recordings were considered, most coming from the

Firenze University Chorus. A database will be

developed and constantly updated, as a reference

for non-professional singers. The database would

allow each singer to follow her/his improvement and

compare results with a reference set, obtained from

professional singers.

Future work will be devoted to building a user-

friendly interface, as well as adding new meaning-

ful parameters to assess voice quality improve-

ment, that could be useful both for trainers and

singers.

Acknowledgements

The authors would like to acknowledge both the

Firenze University Chorus and the professional

singers for kind cooperation in providing a wide set

of vocalizations for this study.

References

1. Horward DM, Welch GF, Brereton J, Himonides E, DeCosta

M, Williams J, et al. WinSingad: A real-time display for the

singing studio. Logoped Phoniatr Vocol. 2004;/3:/135�/44.

2. Horward DM, Welch GF, VOXed Project. Available from

URL: www.voxed.org.

3. Hunt AD, Horward DM, Morrison G, Worsdall J. Real time

interfaces for speech and singing. Proceedings of 26th

Euromicro Conference, IEEE Computer Society, Maastricht.

2000. 2, 356�/61.

4. Horward DM. SINGAD: A visual feedback system for

children’s voice pitch development. In: White P, editor. Child

voice. Stockholm: KTH Voice Research Center; 2000. p. 45�/

62.

5. Garner PE, Horward DM. Real time display of voice source

charactheristics. Logoped Phoniatr Vocol. 1999;/24:/19�/25.

6. Sundberg J. The Science of Singing Voice. DeKalb, Illinois:

North Illinois University Press; 1987.

7. Marple SL. Digital spectral analysis with applications. Engle-

wood Cliffs, NJ, USA: Prentice Hall; 1987.

8. Manfredi C, D’Aniello M, Bruscaglioni P, Ismaelli A. A

Comparative Analysis of Fundamental Frequency Estimation

Methods with Application to Pathological Voices. Med Eng

Phys. 2000;/2:/135�/47.

9. Rao BD, Arun KS. Model based processing of signals: a state

space approach. Proc of the IEEE. 1992;/80:/283�/309.

10. Ephraim Y, Van Trees HL. A signal subspace approach for

speech enhancement. IEEE Trans Speech Audio Process.

1995;/3:/251�/66.

11. Fort A, Ismaelli A, Manfredi C, Bruscaglioni P. Parametric

and non Parametric Estimation of Speech Formants: Appli-

cation to Infant Cry. Med Eng Phys. 1996;/8:/677�/91.

12. Mallat SG. A theory for multiresolution signal decomposi-

tion: the wavelet representation. IEEE Trans Pattern Anal

Machine Intelligence. 1989;/7:/674�/93.

13. Daubechies I. The wavelet transform, time-frequency locali-

sation and signal analysis. IEEE Trans on Info Theory. 1990;/

5:/961�/1005.

14. Kadambe S, Bourdeaux-Bartels GF. Application of the

Wavelet transform for pitch detection of speech signals.

IEEE Trans Inf Theory. 1992;/2:/917�/24.

15. Deller JR, Proakis JG, Hansen JHL. Discrete-time processing

of speech signals. New York: Macmillan Pub. Co; 1993.

16. Manfredi C. Adaptive Noise Energy Estimation in Patholo-

gical Speech Signals. IEEE Trans Biomed Eng. 2000;/47:/

1538�/42.

17. Manfredi C, Peretti G. A new insight into post-surgical

objective voice quality evaluation. Application to thyroplastic

medialisation, IEEE Trans Biomed Eng. 2005 (in press).

18. Manfredi C, Peretti G, Bocchi L, Bruscaglioni P. Tracking

disphonic voice parameters: application to unilateral vocal

cord paralysis. Proc. Irish Signals and System Conf., Treaty

Press Ltd. Limerick, Ireland, June 30�/July 2, 2003: 142�/7.

19. Manfredi C, Bruscaglioni P. Pitch and noise estimation in

pathological speech signals. Proc. World Multiconf. on

Systemics, Cybernetics and Informatics, IIIS Orlando, FL,

USA, July 21�/25, 2001: 388�/93.

20. Fischer PM. Die Stimme des Sangers. Stuttgart: Metzler;

1993.

21. Bretos J, Sundberg J. Measurements of vibrato parameters in

long sustained crescendo notes as sung by ten sopranos. J of

Voice. 2003;/17:/343�/53.

22. Markel JD, Gray AH. Linear prediction of speech. Berlin,

DE: Springer-Verlag; 1982.

23. Fort A, Manfredi C. Acoustic analysis of new-born infant cry

signals. Med Eng Phys. 1998;/20:/432�/42.

24. Manfredi C, D’Aniello M, Bruscaglioni P. A simple subspace

approach for speech denoising. Logoped Phoniatr Vocol.

2001;/4:/179�/92.

25. Shipp T, Leanderson R, Sundberg J. Some acoustic char-

acteristic of vocal vibrato. Int J of Research in Choral Singing.

1980;/4:/18�/25.

26. Fussi F, Magnani S. L’Arte Vocale. Bologna: Omega, eds;

1994.

27. Fussi F. La Voce del Cantante: Saggi di Foniatria Artistica.

Bologna: Omega Eds; 2000.


Log

oped

Pho

niat

r V

ocol

Dow

nloa

ded

from

info

rmah

ealth

care

.com

by

UB

Mai

nz o

n 10

/25/

14Fo

r pe

rson

al u

se o

nly.

Documents

Objective analysis of the singing voice as a training aid