Upload
p
View
219
Download
5
Embed Size (px)
Citation preview
ORIGINAL ARTICLE
Objective analysis of the singing voice as a training aid
T. SANGIORGI1, C. MANFREDI2 & P. BRUSCAGLIONI1
1Dept. of Physics, Universita degli Studi di Firenze and Istituto Nazionale per la Fisica della Materia, Firenze, Italy, 2Dept. of
Electronics and Telecommunications, Universita degli Studi di Firenze, Firenze, Italy
AbstractA new tool for robust tracking of fundamental frequency is proposed, along with an objective measure of main singing voiceparameters, such as vibrato rate, vibrato extent, and vocal intonation. High-resolution Power Spectral Density estimation isimplemented, based on AutoRegressive models of suitable order, allowing reliable formant tracking also in vocalizationscharacterized by highly varying values. The proposed techniques are applied to about 1000 vocalizations, coming from bothprofessional and non-professional singers, and show better performance as compared to classical Fourier-based approaches.If properly implemented, and with a user-friendly interface, the new tool would allow real-time analysis of singing voice.Hence, it could be of help in giving non-professional singers and singing teachers reliable measures of possibleimprovements during and after training.
Key words: AR models, formants, fundamental frequency, objective parameter evaluation, parametric PSD, singing voice
analysis, vibrato extent, vibrato rate
Introduction
This paper aims at contributing to the objective
analysis of singing voice, in order to give non-
professional singers either an aid to improve their
voice capabilities or a criterion to prevent a wrong
vocal attitude (positioning, posture, etc.) that could
even cause vocal pathologies.
The need for objective singing voice analysis arises
from the fact that singing learning is mainly based on
the perception made by the singer and teacher
during his/her performance. It is still mainly up to
the singing teacher to evaluate the quality of a
performance, as at present several effective visual
aids are available (1�/6), but few of them are
supported by robust objective means to evaluate
singing capability and/or improvements. Specifically,
robust, high-resolution and adaptive techniques are
required for this application. The aim of the present
work is to provide such techniques, which could also
be successfully integrated into already existing tools.
Our study involved both professional and non-
professional male and female singers (40% male and
60% female), with ages ranging between 20 and 45
years, that self-declared to be in good health. About
15% of them were smokers.
Professionals (baritone, tenor, contralto, mezzo-
soprano, soprano) are Western opera singers and
singing teachers in music academies (Firenze and
Perugia, Italy) and in private institutes in the
Tuscany region, Italy. Non-professionals are mem-
bers of the Firenze University chorus, exhibiting
non-homogeneous singing technique skills, rang-
ing from the very beginner to almost-professional
singers.
Our study represents the first step within a wider
project that consists in evaluating and monitoring
the development of singing capabilities of the
members of the Firenze University chorus.
For comparison, the most important Western
opera singing techniques were considered, such as:
sustained vowels at different frequencies, vibrato
(periodic fundamental frequency modulation of
about two semitones), trillo (periodic fundamental
frequency modulation of about four semitones),
glissando (uniform ascending or descending scale),
messa di voce (increasing and lowering the intensity
Correspondence: Claudia Manfredi, Department of Electronics and Telecommunications, Faculty of Engineering, Universita degli Studi di Firenze, Via S.
Marta 3, 50139 Firenze, Italy. Fax: �/39-055-494569. E-mail: [email protected]
Logopedics Phoniatrics Vocology. 2005; 30: 136�/146
ISSN 1401-5439 print/ISSN 1651-2022 online # 2005 Taylor & Francis
DOI: 10.1080/14015430500294064
Log
oped
Pho
niat
r V
ocol
Dow
nloa
ded
from
info
rmah
ealth
care
.com
by
UB
Mai
nz o
n 10
/25/
14Fo
r pe
rson
al u
se o
nly.
of a sustained vowel at constant pitch). About 1000
recordings were analysed. Notice that some of these
techniques, such as trillo and messa di voce, require
high capabilities. Hence, they were only performed
by highly skilled singers.
By analysing professional singers, as well as
commercial recordings of famous singers, and on
the basis of the results proposed in literature, we
have evaluated some key parameters, along with
their range of variation, that could objectively
characterize a well-performed vocal exercise.
Specifically, to determine the quality of a vibrato
vocalism, vibrato rate (i.e., the rate of the vibrato
modulation, expressed in cycles/s), vibrato extent,
vocal intonation should be evaluated along with their
standard deviations. Similarly, in a glissando, the
range of frequencies reached with a specific vocal
register without voice breaks should be measured.
Finally, by analysing a sustained vowel or messa di
voce without vibrato, it should be pointed out how
properly the singer is capable of ‘hitting the note’
and ‘keeping it’.
Moreover, in all the cited techniques, it should be
determined how appropriate was the vocal articula-
tion. This has been obtained by means of robust
techniques for spectral analysis (pitch and formant
tracking), capable of dealing with such highly vary-
ing signal parameters as used in this study.
Analysis techniques
The main features of singing voice comprise funda-
mental frequency (F0) of vocal fold oscillation,
which is directly related to pitch, along with its
modulation in time and frequency, and formants,
i.e., the resonance frequencies of the vocal tract,
along with their energy.
Due to high signal variability, robust analysis
techniques were implemented, capable of following
fast and huge fundamental frequency variations,
typical of some vocalization. Moreover, parametric
techniques for high-resolution formant estimation
were applied, based on AutoRegressive (AR) models
of suitable order, linked to singer’s sex and to the
signal sampling frequency Fs. With AR models, the
signal s(n) at time instant n is described by a linear
combination of its p past values, plus a noise term
e(n):
s(n)�a1s(n�1)�a2s(n�2)� . . .�aps(n�p)
�e(n) (1)
In Equation 1, the unknowns are the model order
p and the parameters ai, with i�/1, . . ., p. This
problem can be solved with the approach based on
system identification theory (7).
Classical techniques based on Fourier transform
were also considered, to compare results in terms of
robustness and resolution capability, especially when
applied to short data frames.
Fundamental frequency estimation
The fundamental frequency F0 is estimated here by
means of a two-step procedure. The choice of the
techniques adopted in each step results from a
detailed comparative analysis of pitch extraction
methods (8).
Simple Inverse Filter Tracking (SIFT) is applied
first, on signal time windows of short and fixed
length M. The window length is chosen as M�/3Fs/
Fmin, where Fs is the signal sampling frequency, and
Fmin is the minimum allowed F0 value for the signal
under consideration (here: Fmin�/50 Hz, corre-
sponding to very low male pitch). A short time
window is required, due to possible high non-
stationarity of the signal under study.
The SIFT approach follows the basic strategy of
pre-whitening the speech signal followed by auto-
correlation, the pre-whitening step involving the use
of Linear Prediction (LP) based Inverse Filtering
(IF). To create an IF, a low-order LP is usually
selected (order p:/4), since no more than two
formants are expected in the low-passed signal
frame. However, for highly varying signals such as
those under study, an adaptive choice for the filter
order is proposed here, based on Singular Value
Decomposition (SVD) of matrices whose entries
come from sampled speech data frames, properly
organized (9,10). Notice that SVD requires selecting
the ‘size’ p of the signal subspace, i.e., the minimum
number of eigenvectors spanning the ‘meaningful’
data. To this aim, a variable threshold is defined,
based on the Dynamic Mean Evaluation (DME)
criterion, based on the geometric distance among
‘large’ and ‘small’ singular values, as it was found
more parsimonious than classical approaches (11).
The DME is applied to the decreasing sequence of
singular values s2i . Typically, with DME, 25/p5/6
during the utterance, due to changing signal char-
acteristics: the larger the estimated p, the more
varying the signal.
From the first step, a first raw F0 tracking is
obtained, along with its range of variation [Fl,Fh].
The second step gives a more accurate F0 estima-
tion and allows defining the optimum and possibly
varying window length (time window, TW), also
used for formant estimation, as will be explained in
the next section.
F0 is now adaptively estimated in the frequency
range [Fl, Fh], obtained in the previous step. More-
over, data frames are overlapped for 3/4 of length, in
Singing voice analysis 137
Log
oped
Pho
niat
r V
ocol
Dow
nloa
ded
from
info
rmah
ealth
care
.com
by
UB
Mai
nz o
n 10
/25/
14Fo
r pe
rson
al u
se o
nly.
order to track both F0 and formant values more
accurately. This step applies Continuous Wavelets
Transform (CWT) and Average Magnitude Differ-
ence Function (AMDF), on short time windows of
varying length (three pitch periods, inversely pro-
portional to previously estimated local F0). To
perform F0 estimation, the signal is band-pass
filtered (50 Hz�/1000 Hz, or more) with a proper
Continuous Wavelet Transform (Mexican hat) (12�/
14) and its periodicity is extracted by means of
the Average Magnitude Difference Function
(AMDF) approach. The AMDF analysis, which is
directly carried out on signals in the time domain,
can be used to detect fast and slow variations of the
fundamental frequency F0 (15). The choice of
the AMDF instead of the autocorrelation sequence
(AS) is due to the non-stationarity and amplitude
modulation of the signals under study, that were
shown to often cause misestimation of the true s
ignal periodicity with the AS (8). For fast and abrupt
F0 changes, this procedure was shown to increase
robustness in F0 estimation, giving enhanced results
with respect to standard ones (16,17).
Summarizing, the full procedure for F0 estimation
is the following:
. The signal is band-pass filtered in the range 50
Hz�/1000 Hz (i.e., the F0 range for the singing
voices under study); this range can be in-
creased, if required.
. On each time window of length 3Fs/F0min, SVD
and SIFT are applied, and a first rough F0
estimate is obtained, along with its range of
variation Fl5/F05/Fh.
. Inside [Fl,Fh], on each TW of varying length
TW�/3Fs/F0 (F0 estimated in the previous
step), a coefficient matrix for the CWT(h,s) is
obtained, where h is the shift parameter and s is
the scale one.
. From the coefficient matrix CWT(h,s) the
optimum scale value, /s
˘
, is selected as the one
corresponding to the maximum entry, which
represents the best fitting of the wavelet to data.
. The AMDF technique is applied to CWT(h,/s
˘
)
thus obtaining the estimate of F0 as: F0�/
Fs/hmin, where hmin is the AMDF minimum.
These steps are described in detail, see (16,18,19).
Vibrato estimation
Vibrato singing technique consists of an almost
periodic F0 modulation. In professional singers,
vibrato profile often shows a sinusoidal, triangular
or a generally well defined periodic shape. The
mechanism of vibrato production is not yet comple-
tely known. Fisher (20) explained it considering the
presence of two phenomena, glottal wave and
breathing wave.
According to the literature (6), three parameters
can be evaluated for vibrato characterization:
1. Vibrato rate (VRate), which represents the
number of F0 oscillation in time (cycles/s). In
this work, it has been evaluated as the recipro-
cal of the mean time difference between two
subsequent maxima:
VRate�1
N
XN�1
i�1j 1
tMax
i�1� tMax
ij (2)
where tiMax�/i-th maximum and N�/number of
maxima.
2. Vibrato extent (VExtent), which is the differ-
ence in frequency between a maximum and a
minimum within a cycle (Hz). Here, it has been
obtained as the mean of the differences:
VExtent�XN
i�1
(fMaxi � fMin
i )
N(3)
where fiMax and fi
Min are the frequency values of
the i-th maximum and minimum respectively,
in each vibrato cycle.
3. Vocal intonation (MF0), which is the trend of
the mean of the difference in frequency be-
tween a maximum and a minimum in the ith
cycle (21):
MF0i�fMax
i � fMini
2(4)
This parameter has received less attention in
literature with respect to the other two, and hence
has not been deeply investigated in this study.
For each parameter, standard deviation was also
evaluated, as a measure of the quality of the
performance.
The proposed algorithm exploits the F0 values
obtained as described in the previous section. Notice
that F0 is evaluated as the mean value over three
50% overlapped pitch periods, each of about 5�/7 ms
duration, hence on time windows of 15�/20 ms at
most. A vibrato cycle is about ten times longer,
hence the averaging effect is negligible.
Evaluating VRate, VExtent and MF0 required
setting up a routine capable of correctly finding
absolute maxima and minima, i.e., the maximum
and minimum value in each vibrato cycle. The
implemented routine should also be capable of
dealing with both professional and non-professional
singers. In fact, the irregularity of the vibrato profile
138 T. Sangiorgi et al.
Log
oped
Pho
niat
r V
ocol
Dow
nloa
ded
from
info
rmah
ealth
care
.com
by
UB
Mai
nz o
n 10
/25/
14Fo
r pe
rson
al u
se o
nly.
carries out maxima and minima with varying ampli-
tude in each cycle, therefore a robust criterion to
discriminate between absolute and relative maxima
or minima is required.
As shown in Figure 1, concerning professional and
non-professional baritone vocalizations (sustained /
a/ vowel), while the vocalization by professionals
presents a regular F0 profile (Figure 1a), non-
professionals and, especially, amateurs (Figure 1b)
exhibit an irregular profile.
Notice also that a threshold criterion, such as F0
mean value, to separate the set of maxima from that
of minima, cannot be used, due to a possibly
irregular vibrato profile. A threshold criterion cannot
be used in trillo technique either, due to its peculiar
profile (Figure 1c).
Taking into account the above mentioned difficul-
ties, the new procedure for absolute maxima identi-
fication is as follows:
. All F0 values are possible candidates as the
absolute maximum in the cycle. A routine
compares the value of the F0 candidate to
that of a selected set of values around the
candidate.
. If the value of the candidate is the largest one in
the set, then it is selected as the absolute
maximum within that cycle, otherwise the
next value becomes the new candidate.
. The size of the set of points around the
candidate value (the frame length) needs to be
adequately chosen, to prevent the presence of
0 0.5 1 1.5 2 2.5 3
–0.5
00.
51
Sustained /a/ with vibrato
Time [s]
norm
.am
pl.
0 0.5 1 1.5 2 2.5 3130
140
150
160
170
180
Time [s]
Mean F0 = 154.1127 Hz
F0
[Hz]
(a)
0 0.5 1 1.5 2
–1–0
.50
0.5
1
–1–0
.50
0.5
1
Sustained /a/ with vibrato
Time [s]
0 0.5 1 1.5 2
180
185
190
195
200
205
Time [s]
Mean F0 = 191.8141 Hz
norm
. am
pl.
F0
[Hz]
(b)
0 1 2 3 4 5 6
Sustained /e/ with trillo
Time [s]
0 1 2 3 4 5 6
200
220
240
260
280
300
320
Time [s]
Mean F0 = 258.3974 Hz
norm
. am
pl.
F0
[Hz]
(c)
Figure 1. a: vibrato, professional baritone; b: vibrato, non-professional baritone); c: trillo, professional baritone.
Singing voice analysis 139
Log
oped
Pho
niat
r V
ocol
Dow
nloa
ded
from
info
rmah
ealth
care
.com
by
UB
Mai
nz o
n 10
/25/
14Fo
r pe
rson
al u
se o
nly.
two maxima pertaining to different vibrato
cycles, and corresponds to about 70 ms, i.e.,
the time range roughly corresponding to half of
a cycle, for a vibrato rate of 7 cycles/s. However,
in a few cases, where the maxima and minima
identification is unclear, this value could be
manually set up after visual inspection (see
Figure 2a and 2b).
The procedure for finding absolute minima fol-
lows the same guidelines.
Maxima and minima evaluation, along with vocal
intonation tracking, are presented in Figure 2.
Specifically, Figure 2a and 2c are relative to the
professional baritone (see Figure 1a), while Figure
2b, 2d concern the non-professional baritone (see
Figure 1b).
Figure 2 clearly shows maxima and minima
identification, as well as the trend of vocal intona-
tion, also for irregular vibrato.
Formant estimation
Two methods have been used to detect the reso-
nances of the vocal tract, named formants: a para-
metric approach, based on AR models for the vocal
tract filter, and a classical one based on Fourier
transform (FT). One of the main advantages of
0 0.5 1 1.5 2 2.5 3
146
148
150
152
154
156
158
160
162
Time [s]
* = maxima ; o = minima
0 0.5 1 1.5 2187
188
189
190
191
192
193
194
195
196
197
198
Time [s]
* = maxima ; o = minima
0 0.5 1 1.5 2 2.5 3
Time [s]
Vocal Intonation
0 0.5 1 1.5 2Time [s]
Vocal Intonation
F0
[Hz]
146
148
150
152
154
156
158
160
162
F0
[Hz]
F0
[Hz]
187
188
189
190
191
192
193
194
195
196
197
198
F0
[Hz]
(a) (b)
(c) (d)
Figure 2. a and b: maxima and minima identification; c and d: vocal intonation.
140 T. Sangiorgi et al.
Log
oped
Pho
niat
r V
ocol
Dow
nloa
ded
from
info
rmah
ealth
care
.com
by
UB
Mai
nz o
n 10
/25/
14Fo
r pe
rson
al u
se o
nly.
parametric spectral analysis over classical approaches
consists in its high-resolution capability, as the
model extrapolates data outside the analysed win-
dow. Hence, better results are achieved with respect
to classical spectral estimators, where the spectral
resolution is limited by data windowing and side
lobes (7). Parametric formant estimation relies on a
vocal tract model made up by an interconnected
series of p cylindrical coaxial lossless cavities of
different length and diameter. The resonances (for-
mants) can be recovered from the maxima of the
Power Spectral Density (PSD), given by:
PSD(f)�T
j1 �Xp
k�1
ake�j2pf j2(5)
where T is the sampling period and ai, i�/1,. . .,p, is
the coefficient of the AR model of order p that
describes the vocal tract (15,22). The formant
tracking procedure was tested on high-pitched
synthesizes signals (14,23), quite similar to the
ones under study here. Robustness to noise and
resolution capabilities were tested, also in case of
almost non-stationary signals, with very good results
as compared to non-parametric approaches.
Notice that AR spectral estimators are sensitive to
order selection: in case of overestimated model order
p, formant splitting may occur, while underestima-
tion smoothes the spectrum and causes misalloca-
tion of spectral peaks. Many criteria have been
defined for finding the best model order p, including
both the estimated variance s2 and the model
complexity p in one set of statistics. Such criteria
are characterized by loss functions for which a
minimum can be achieved. However, they were
shown to be almost unreliable for short data frames,
due to long-term convergence properties. In this
paper, the relation p:/Fs (Fs�/signal sampling
frequency, in kHz) was found the best one for
obtaining a sufficiently detailed spectrum.
This relation comes from the physical constraint:
Fs�/pc/2L, where L is the length of the vocal tract
and c is the speed of sound (22). In fact, in order to
adequately represent the vocal tract model by means
of the polynomial A(z), its ‘memory’ (i.e., its order)
must be equal to twice the time required for sound
waves to travel from the glottis to the lips, that is,
2L/c. For the adult male, L$/17cm, hence, with
c�/34 cm/ms, the necessary memory amounts to 1
ms. Thus, with a sampling frequency Fs�/10 kHz,
the filter order p must be at least 10. The higher the
Fs, the higher the value of p. For female voices, L
being smaller (about 14 cm), a lower value of p
should be selected, around 3/4 Fs. This choice was
added to the formant estimation technique.
Notice that the choice of a model order near or
equal to Fs, prevents spectral smoothing and conse-
quently loss of spectral peaks. This approach has
already been proved effective in many applications,
with enhanced results as far as resolution is con-
cerned (11,19), and is of utmost importance in the
present study, that requires exact formant evalua-
tion. For example, spectral resolution is of great
relevance in Western opera singing, as male singers
use a characteristic vocal articulation to realize a
cluster of the third, fourth and fifth resonances,
which allows them to increase loudness and reach a
specific vocal timbre (6).
Among possible choices, the proposed approach
performs the minimization of the average of the
forward and backward squared prediction errors
over the available data. This approach is commonly
named ‘modified covariance method’, as it was
shown to give the best results as far as reduction of
spectral line splitting and bias of the frequency
estimate are concerned (7,24).
Experimental results
Our study concerns five professional singers: one
baritone for male voices, one contralto, one mezzo-
soprano and two sopranos for female voices. The
baritone and the sopranos were trained in lyric
repertoires, the others in baroque music.
Non-professional singers come from the Firenze
University Choir, which is made up of about 80
singers, with non-homogeneous training, ranging
from the very beginner to the almost expert. About
1000 recordings were obtained from 20 non-profes-
sional singers (2 bassos, 3 baritones, 3 tenors, 2
contraltos, 3 mezzo-sopranos, 7 sopranos). For each
singer, about 40 sung tokens were recorded, corre-
sponding to different vocalizations and different
vowels.
Computations were carried out under
MatlabR12† development environment. Processing
time is low for most signals (2�/3 s length). Speci-
fically, about 1 min for F0 and related parameters,
and another 1 min for spectrogram, formants and
PSD on a standard PC. The longer the signal, the
longer the processing time. However, implementing
the software under C�/�/ or Assembler language
should allow for real-time processing.
In this section, some general remarks are made,
concerning the singing voice features obtained with
the proposed approach, when applied to both
professional and non-professional singers. The re-
ported results are relative to the whole data set under
study. Moreover, two examples are given to compare
results relative to a non-professional and an almost
professional tenor singer, as far as the vibrato
Singing voice analysis 141
Log
oped
Pho
niat
r V
ocol
Dow
nloa
ded
from
info
rmah
ealth
care
.com
by
UB
Mai
nz o
n 10
/25/
14Fo
r pe
rson
al u
se o
nly.
technique is concerned. Objective parameters and
pictures allow objectively quantifying the better
performance of the second singer with respect to
the first one.
Firstly, for professional singers, the features of a
properly performed vocal technique were analysed.
This is in fact of importance, as such features could
be considered as a reference set for non-profes-
sionals. These results were also compared to those
proposed in the literature.
As previously mentioned, main singing techniques
comprise: sustained vowels at different frequencies,
vibrato, trillo, glissando, and messa di voce.
To evaluate the quality of the sustained vowel and
messa di voce techniques, F0 mean was considered
to represent the height of the note (i.e., its fre-
quency). Two more parameters were also consid-
ered: F0 standard deviation (s) and the difference
between the F0 maximum and the F0 minimum in
the vocalization (D):
D�F0max�F0min (6)
For a well-performed sustained vowel, it was
found that sB/1 Hz and DB/5 Hz.
In the glissando technique, attention was focused
on the vocal extension, and in particular on the
frequency range that a singer was capable of
performing in a specific vocal register without voice
breaks.
On the contrary, non-professionals, and especially
amateurs, often exhibited a limited vocal extension,
voice breaks and an improper use of vocal registers.
This means that they were almost unable to perform
a wide range of frequencies with the register required
by the musical repertory. Specifically, in ascendant
glissandos performed by non-professional tenors,
voice breaks were found between head voice and
falsetto in the range of frequencies 340 Hz�/400 Hz.
Moreover, in Western opera, the use of falsetto for
frequencies less than 450 Hz is generally not used.
As already said, the main parameters used to
characterize vibrato technique are Vibrato Rate
(VRate), Vibrato Extent (VExtent), Vocal Intonation
(MF0) along with their standard deviations (Equa-
tions (2)�/(4)). According to the literature (6), a
good quality vibrato has a VRate value generally
pertaining to the range 5.5�/7.5 cycles/s, and a
VExtent value of less than 2 semitones. However,
VRate is a parameter that seems to vary according to
age, sex, and emotional status of the singer (25).
Recent studies (21) have shown how vibrato rate
could also vary according to the frequency of the
note performed. The professional singers that were
analysed showed VRate values varying between
4.5�/7 cycles/s, with a standard deviation sVRB/
0.36 cycles/s, and a VExtentB/2 semitones, with a
standard deviation sVEB/5 Hz .
These values were also found in the analysis of
messa di voce executed with vibrato by professionals,
where vibrato was present in the central part of the
vocalization.
Notice that most professionals and non-profes-
sionals analysed with our approach presented a
vibrato rate which varied according to the frequency
of the note performed, in agreement with Sundberg’s
results (21).
Professionals declared that changing the vibrato
rate helped them ‘to free their voice’ in the mechan-
ism of vibrato production.
In non-professionals, vibrato rate could also
vary according to the kind of vowel performed.
This is due to the fact that non-professionals,
especially amateurs, have not yet reached a high
level of technique, and consequently they have not
enough self-confidence in performing. Finally, no-
tice that few amateurs could not perform vibrato
technique on all the seven vowel sounds of the Italian
language.
Moreover, non-professionals, especially male
amateurs, showed a vibrato extent less than one
semitone. This feature could be due to the fact that
choir members are taught to use a smaller vibrato in
order to avoid strong interference with other singers.
Some remarks can also be made concerning a
possible relationship between vibrato waveform and
vocal register, a subject which is still not completely
investigated and explained in literature. With our
analysis, a link between these features was found in
some vocalizations of the Italian vowel /a/ for a
professional mezzo-soprano. In fact, performing a
vibrato in chest register led to a triangular-shaped
waveform, while producing a vibrato in mixed
register led to a sinusoidal-shaped waveform. This
seems to be connected to the mechanism of vibrato
production and should be further investigated.
Finally, as far as trillo technique is concerned, we
remark that, according to the literature, a good
quality trillo should have VRate values in the
range 5.5�/7.5 cycles/s and VExtent less than 4
semitones (6).
Professional singers were free to perform different
trillo at several frequencies. By analysing trillo
technique in a professional baritone, commonly
three different parts of the trillo could be found: an
introduction, a ‘body’, and an end (Figure 1c).
These three parts showed peculiar features. Dur-
ing the introduction, there is a lower vibrato rate,
which seems useful to emphasize the following part
of the vocalization, but with the same vibrato extent
as in the body of trillo. The body and the end have
142 T. Sangiorgi et al.
Log
oped
Pho
niat
r V
ocol
Dow
nloa
ded
from
info
rmah
ealth
care
.com
by
UB
Mai
nz o
n 10
/25/
14Fo
r pe
rson
al u
se o
nly.
similar vibrato rates, but different vibrato extent,
which is smaller in the end.
The professionals examined with our technique
showed VRate values between 4.5�/7 cycles/s, with a
standard deviation sVRB/0.37 cycles/s, and a
VextentB/4 semitones, with a standard deviation
sVRB/5 Hz, in agreement with the ranges previously
indicated.
The following examples concern the vocalizations
of two tenors that show different vocal skills.
Specifically, the first one is a one-year trained singer
while the second is a three-year trained singer, who
also attends individual singing lessons. The goal is to
appreciate in an objective way the different vocal
capabilities throughout the analysis of a vibrato sung
token on Italian vowel /a/. To this aim, the two
singers were selected such that they have comparable
F0 mean values (around 350 Hz, for sustained /a/
vibrato sung token).
For each vocalization, the results concerning F0
estimation and formant tracking (labelled F1, F2,
F3, etc.) are presented. Specifically, the first plot
shows both the audio signal (Fs�/44.1 kHz, 16 bit
resolution), and F0 tracking, as obtained with the
proposed approach.
The second plot shows the signal spectrogram
(frequency versus intensity versus time), obtained
using FT with a Hann window of length equal to TW
(given by the second F0 estimation step) and 1/3
overlap. Formant trajectories, as obtained by the
proposed high-resolution AR parametric technique,
are overlapped on the spectrogram.
The third picture concerns Power Spectral Den-
sity (PSD). PSD has been estimated both by classical
Fast Fourier Transform (FFT) and parametric AR-
PSD (Equation (5)), with a model order p�/Fs,
according to the considerations made in the section
devoted to formant estimation. Plots are overlapped,
in order to show the high-resolution capability of the
parametric approach.
Notice that the PSD plot represents the mean
(averaged on all the signal frames) relative to the
maximum value of the PSD, PSDmax, i.e.:
PSD�10log10
Mean(PSD(frame))
PSDmax
(7)
and hence it appears as smoothed. However, PSD
maxima are clearly found with the proposed ap-
proach, along with their energy. Notice also that
‘local’ PSD plots are obtained on each signal frame
during computations, and can be individually in-
spected, if required.
Figure 3 concerns the first tenor (singer with only
one year training), and are relative to a vibrato
vocalization.
As shown in Figure 3a, the vibrato profile is only
approximately sinusoidal. The following results per-
tain to this vocalization:
F0Mean�345:01 Hz; sMF0�0:60 Hz
VRate�6:25 cycles=s; sVR�0:46 cycles=s;
VExtent�10:75 Hz; sVE�3:41 Hz
The singer shows a vibrato rate that is within the
expected range of values, but a sVR that is higher
than the standard deviation associated with a good
quality performance. The irregularity of the vibrato
is also perceived at listening. Moreover, the vibrato
rate results are much smaller than the reference
value (about 2 semitones). As previously pointed
out, this could be due to the fact that choir members
are taught to use a smaller vibrato to avoid interfer-
ing too much with other voices.
Though non-professional, the singer performs
with good vocal intonation, as described by the low
values sMF0.
Figure 3b represents the signal spectrogram, i.e.,
the distribution of the signal energy versus time. The
time window length is the optimal one, as obtained
with F0 estimation, and corresponds to three pitch
periods. Decreasing the time window length causes
lowering of frequency resolution, while increasing it
causes the reverse effect, with better clarity of the
harmonics. It is evident that the range of frequencies
up to about 1500 Hz is a high-energy range.
However, from the analysis of the spectrogram
only, it is rather difficult to find out formants
location. Hence, formant trajectories, as obtained
with parametric AR models, are superimposed on
the spectrogram. The proposed approach succeeded
in resolving formants F3 and F4, even if this is quite
difficult, as they are very close to each other due to
cluster. The fifth formant (F5) is not involved in the
cluster and it is not clearly tracked in the picture, due
to its low energy.
The PSD plots, evaluated both with FT and AR
model of order 44 (corresponding to Fs), are
represented in Figure 3c. While FT is only capable
of detecting the harmonics of the spectrum, the
high-resolution AR technique allows easier finding of
possible formant positions, even those very close in
frequency.
Specifically, the first (F1), the second (F2) and the
fourth (F4) formants are clearly detected at about
800 Hz, 1200 Hz and 3200 Hz, respectively. The
third formant, F3, corresponds to a lower energy
level with respect to the others. It is located around
2700 Hz and realizes a cluster with the fourth
formant (within the range 2500 Hz�/3200 Hz).
However, in spite of the cluster production, the
energy level associated with F3 and, especially, F4-
results much lower than the energy associated with
Singing voice analysis 143
Log
oped
Pho
niat
r V
ocol
Dow
nloa
ded
from
info
rmah
ealth
care
.com
by
UB
Mai
nz o
n 10
/25/
14Fo
r pe
rson
al u
se o
nly.
the first two formants. In male singing voices, this
corresponds to improper vocal articulation that leads
not only to lower loudness, but also to voice
characteristics that are judged by teachers as not
suitable for the lyrical repertoire.
In conclusion, the first singer has not yet devel-
oped a good vocal technique. He cannot easily
handle the mechanism of vibrato production, and
he is not yet able to realize a proper vocal articula-
tion and thus requires further training.
Figure 4 concerns a singer who has had three years
of training, for a sung sustained /a/ with vibrato.
As it is shown in Figure 4a, the vibrato profile is
now almost sinusoidal. The following objective
parameters were obtained for this vocalization.
F0Mean�340:51 Hz;
sMF0�1:17 Hz:
VRate�5:45 cycles=s; sVR�0:27 cycles=s;
VExtent�13:49 Hz; sVE�2:16 Hz
The singer has very good values for vibrato rate.
However, in this vocalization the vibrato extent is
lower than the reference value (6).
The signal spectrogram (Figure 4b) shows two
high-level energy ranges of frequencies, the first one
lying between 500 Hz and 1500 Hz, and the second
between 3000 Hz and 4500 Hz. In both of them the
harmonics modulation caused by vibrato production
is clearly visible. Notice that, once again, the
spectrogram alone does not allow a precise formant
0 0.2 0.4 0.6 0.8 1
–1–0
.50
0.5
1
Sustained /a/ with vibrato
Time [s]
0 0.2 0.4 0.6 0.8 1320
360
350
340
330
370
Time [s]
Mean F0 = 345.0112 Hz
norm
.am
pl.
F0
[Hz]
–70
–60
–50
–40
–30
–20
–10
0
10
20
30
Time [s]
Spectrogram and formants (*) - Time window length = 3Fs /F0
0 0.2 0.4 0.6 0.8 1
010
0020
0030
0040
0050
0060
0070
0080
00
Sam
plin
g fr
eq. F
s=44
.1 (
kHz)
Fre
q. [H
z]
0 1000 2000 3000 4000 5000 6000–50
–45
–40
–35
–30
–25
–20
–15
–10
–50
Dashed: FFT; solid: AR(44)
Freq. [Hz]
Mea
n P
SD
[dB
]
(a)
(b) (c)
Figure 3. Non-professional tenor, vibrato for sustained /a/ vowel. a: signal amplitude and F0 tracking; b: spectrogram and formant tracking
(grey scale in dB); c: PSD mean (dashed�/FFT; solid�/AR).
144 T. Sangiorgi et al.
Log
oped
Pho
niat
r V
ocol
Dow
nloa
ded
from
info
rmah
ealth
care
.com
by
UB
Mai
nz o
n 10
/25/
14Fo
r pe
rson
al u
se o
nly.
identification. Formant trajectories are superim-
posed on the spectrogram for clarity.
Figure 4c shows the PSD with formants, as
obtained by FT and the AR parametric technique,
along with their energy. Specifically, the first five
formants, F1�/F5, are clearly found and resolved
(at about 800 Hz, 1200 Hz, 3100 Hz, 3500 Hz,
4000 Hz). Moreover, resonances F3�/F5 give rise to
a very high energy cluster, whose energy level is
comparable to the energy level of the first two
resonances.
This formant strategy leads to the well known
‘singing formant’ (6,26,27). This high-energy cluster
is widely used by male singers to gain loudness and
to realize the Western opera voice timbre. The
trained singer performed a really good vocal techni-
que also in other vocalizations. In conclusion, these
results confirm that he has achieved an excellent
vocal articulation.
Conclusions
In this work, objective indices for singing voice
analysis are proposed. Due to high signal variability,
robust analysis techniques are implemented, capable
of following fast and huge fundamental frequency
variations, typical of some vocalization. Moreover,
vibrato rate and extent are evaluated, in order to give
the singer useful information concerning the degree
of achieved professional level, as compared to
professional ones.
Parametric techniques for high-resolution formant
estimation are applied, based on AutoRegressive
(AR) models of suitable order, linked to singer’s
0 0.5 1 1.5 2
–0.5
00.
51
Sustained /a/ with vibrato
Time [s]
0 0.5 1 1.5 2300
320
340
360
380
Time [s]
Mean F0 = 340.5123 Hz
norm
. am
pl.
F0
[Hz]
(a)
–80
–60
–40
–20
0
20
Time [s]
Spectrogram and formants (*) - Time window length = 3Fs/F0
0 0.5 1 1.5 2
010
0020
0030
0040
0050
0060
0070
0080
00
Sam
plin
g fr
eq. F
s=44
.1 (
kHz)
Fre
q. [H
z]
(b) (c)
0 1000 2000 3000 4000 5000 6000
–50
–45
–40
–35
–30
–25
–20
–15
–10
–50
Dashed: FFT; Solid: AR(44)
Freq. [Hz]
Mee
n P
SD
[dB
]
Figure 4. Three-years trained tenor, vibrato for sustained /a/ vowel. a: signal amplitude and F0 tracking; b: spectrogram and formant
tracking (grey scale in dB); c: PSD mean (dashed�/FFT; solid�/AR).
Singing voice analysis 145
Log
oped
Pho
niat
r V
ocol
Dow
nloa
ded
from
info
rmah
ealth
care
.com
by
UB
Mai
nz o
n 10
/25/
14Fo
r pe
rson
al u
se o
nly.
sex and signal sampling frequency Fs. This allows
accurate testing for possible singer formant develop-
ment and formant tuning production, as well as
quantifying their energy. Fundamental frequency
and formant tracking, along with PSD and spectro-
gram plots, give easily readable results. Sustained
vowels, vibrato, trillo, glissando and messa di voce
were analysed, from baritone, tenor, contralto,
soprano and mezzo-soprano singers. About 1000
recordings were considered, most coming from the
Firenze University Chorus. A database will be
developed and constantly updated, as a reference
for non-professional singers. The database would
allow each singer to follow her/his improvement and
compare results with a reference set, obtained from
professional singers.
Future work will be devoted to building a user-
friendly interface, as well as adding new meaning-
ful parameters to assess voice quality improve-
ment, that could be useful both for trainers and
singers.
Acknowledgements
The authors would like to acknowledge both the
Firenze University Chorus and the professional
singers for kind cooperation in providing a wide set
of vocalizations for this study.
References
1. Horward DM, Welch GF, Brereton J, Himonides E, DeCosta
M, Williams J, et al. WinSingad: A real-time display for the
singing studio. Logoped Phoniatr Vocol. 2004;/3:/135�/44.
2. Horward DM, Welch GF, VOXed Project. Available from
URL: www.voxed.org.
3. Hunt AD, Horward DM, Morrison G, Worsdall J. Real time
interfaces for speech and singing. Proceedings of 26th
Euromicro Conference, IEEE Computer Society, Maastricht.
2000. 2, 356�/61.
4. Horward DM. SINGAD: A visual feedback system for
children’s voice pitch development. In: White P, editor. Child
voice. Stockholm: KTH Voice Research Center; 2000. p. 45�/
62.
5. Garner PE, Horward DM. Real time display of voice source
charactheristics. Logoped Phoniatr Vocol. 1999;/24:/19�/25.
6. Sundberg J. The Science of Singing Voice. DeKalb, Illinois:
North Illinois University Press; 1987.
7. Marple SL. Digital spectral analysis with applications. Engle-
wood Cliffs, NJ, USA: Prentice Hall; 1987.
8. Manfredi C, D’Aniello M, Bruscaglioni P, Ismaelli A. A
Comparative Analysis of Fundamental Frequency Estimation
Methods with Application to Pathological Voices. Med Eng
Phys. 2000;/2:/135�/47.
9. Rao BD, Arun KS. Model based processing of signals: a state
space approach. Proc of the IEEE. 1992;/80:/283�/309.
10. Ephraim Y, Van Trees HL. A signal subspace approach for
speech enhancement. IEEE Trans Speech Audio Process.
1995;/3:/251�/66.
11. Fort A, Ismaelli A, Manfredi C, Bruscaglioni P. Parametric
and non Parametric Estimation of Speech Formants: Appli-
cation to Infant Cry. Med Eng Phys. 1996;/8:/677�/91.
12. Mallat SG. A theory for multiresolution signal decomposi-
tion: the wavelet representation. IEEE Trans Pattern Anal
Machine Intelligence. 1989;/7:/674�/93.
13. Daubechies I. The wavelet transform, time-frequency locali-
sation and signal analysis. IEEE Trans on Info Theory. 1990;/
5:/961�/1005.
14. Kadambe S, Bourdeaux-Bartels GF. Application of the
Wavelet transform for pitch detection of speech signals.
IEEE Trans Inf Theory. 1992;/2:/917�/24.
15. Deller JR, Proakis JG, Hansen JHL. Discrete-time processing
of speech signals. New York: Macmillan Pub. Co; 1993.
16. Manfredi C. Adaptive Noise Energy Estimation in Patholo-
gical Speech Signals. IEEE Trans Biomed Eng. 2000;/47:/
1538�/42.
17. Manfredi C, Peretti G. A new insight into post-surgical
objective voice quality evaluation. Application to thyroplastic
medialisation, IEEE Trans Biomed Eng. 2005 (in press).
18. Manfredi C, Peretti G, Bocchi L, Bruscaglioni P. Tracking
disphonic voice parameters: application to unilateral vocal
cord paralysis. Proc. Irish Signals and System Conf., Treaty
Press Ltd. Limerick, Ireland, June 30�/July 2, 2003: 142�/7.
19. Manfredi C, Bruscaglioni P. Pitch and noise estimation in
pathological speech signals. Proc. World Multiconf. on
Systemics, Cybernetics and Informatics, IIIS Orlando, FL,
USA, July 21�/25, 2001: 388�/93.
20. Fischer PM. Die Stimme des Sangers. Stuttgart: Metzler;
1993.
21. Bretos J, Sundberg J. Measurements of vibrato parameters in
long sustained crescendo notes as sung by ten sopranos. J of
Voice. 2003;/17:/343�/53.
22. Markel JD, Gray AH. Linear prediction of speech. Berlin,
DE: Springer-Verlag; 1982.
23. Fort A, Manfredi C. Acoustic analysis of new-born infant cry
signals. Med Eng Phys. 1998;/20:/432�/42.
24. Manfredi C, D’Aniello M, Bruscaglioni P. A simple subspace
approach for speech denoising. Logoped Phoniatr Vocol.
2001;/4:/179�/92.
25. Shipp T, Leanderson R, Sundberg J. Some acoustic char-
acteristic of vocal vibrato. Int J of Research in Choral Singing.
1980;/4:/18�/25.
26. Fussi F, Magnani S. L’Arte Vocale. Bologna: Omega, eds;
1994.
27. Fussi F. La Voce del Cantante: Saggi di Foniatria Artistica.
Bologna: Omega Eds; 2000.
146 T. Sangiorgi et al.
Log
oped
Pho
niat
r V
ocol
Dow
nloa
ded
from
info
rmah
ealth
care
.com
by
UB
Mai
nz o
n 10
/25/
14Fo
r pe
rson
al u
se o
nly.