CS 551/651:Structure of Spoken Language
Lecture 9: The Source-Filter Modelof Speech Production
John-Paul HosomFall 2008
The Source-Filter Model
One more model of speech… proposed in 1848 by JohannesMüller, developed by Gunnar Fant circa 1970. Also calledthe “Acoustic Theory of Speech Production”.
The Source-Filter Model provides a static description of speech;speech dynamics are dealt with in models of coarticulation.
According to this model, speech is defined by three parts:
1. A sound source vibration of the vocal folds, air turbulence, or plosion
2. A tube through which the source passesthe vocal tract
3. Radiation of sound from the mouth
These 3 components are assumed to be independent.
We will discuss these three parts separately
The Source-Filter Model: Sound Source
Voiced Sound Source:
• produced by vibration of the vocal folds• several models exist that describe the flow of air through
the vocal folds• each model describes the increase in air flow as the glottis
opens, decrease in air flow as it closes, and no air flowas glottis remains closed during pressure buildup.
• in spectral domain, shape is approximately flat at very lowfrequencies, and has –12 dB/octave slope at higher freq.
Models: Rosenberg, Fant (LF model), Fujisaki (FL model), Klattglottis opening glottal closure glottis opening
air
pres
sure
(P
a)
time (msec)
The Source-Filter Model: Sound Source
Voiced Sound Source:
• models are of “glottal flow”• glottal flow is the same as volume velocity, V in units of m3/s• volume velocity per unit area, or V/unit area, is in units of
m/s, and is called the point velocity, v. • acoustic pressure, p, in Pascals, equals impedance Z times v:
p = Z v• impedance is constant for a given glottis and vocal tract• therefore, acoustic pressure is directly proportional to
glottal flow, and so the vertical axis of these models canbe considered either glottal flow, volume velocity, or acoustic pressure (in micro Pascals).
The Source-Filter Model: Sound Source
All models have the following parameters:• pitch period = 1/F0 = T0
• open quotient (OQ)• skew (SK)
These three parameters are used in a function that describes howthe sound pressure changes over time within one pitch period.
glottis opening glottal closure glottis opening
T0
OQSK
OQ measured relative to T0;SK measured relative to OQ
The Source-Filter Model: Sound Source
The Rosenberg model:
(from http://www.physik3.gwdg.de/~micha/aachen98/aachen98.html)
gR(t) is glottal pulse with amplitude A and duration T;gR(t) has three phases: the opening phase until time TO, the closingphase until time TC, and the closed phase with length T-(TO+TC)
TO
TC T
Ei
0 Ti Tp Te TcTa
The Source-Filter Model: Sound Source
The Liljencrants-Fant (LF) Model:
(from http://www.ims.uni-stuttgart.de/phonetik/EGG/page13.htm)
• uses sin() and exp() functions to create smooth trajectory• many parameters allow detailed control of shape
The Fujisaki-Ljungqvist (FL) Model:• similar to LF, but allows negative flow during closed phase• simpler polynomial functions
The Source-Filter Model: Sound Source
Unvoiced Sound Source:
• produced by pushing air through constriction in mouth
• a simple model: noise that decreases at –6 dB/octave
Plosive Sound Source:
• produced by pressure buildup, then release of constriction
• a very simple model: approximately a step function
time
ampl
itud
e
The Source-Filter Model: Vocal Tract Filter
The vocal tract can be modeled as a series of connectedtubes with different lengths and diameters:
A1 A2 A3 A4 A5 A6
l4
d4
Life can be made much more simple if we start withonly two tubes for approximating different vowels:
A1 A2 A1 A2
A1 A2A1 A2
/iy/
/aa/
/uw/
/ah/
The Source-Filter Model: Vocal Tract Filter
An electrical-engineering analogy can be drawn betweenthe tubes and a transmission line.
From this analogy, the formant frequencies (frequencies of standingwaves) occur when
where
(from Flanagan, p. 70-71)
)cot()tan( 122
1 llA
A
m/s3402
cc
f
The Source-Filter Model: Vocal Tract Filter
In the simplest case of a single tube, the formants are located at
l
ciFi 4
)12(
and if l = 17cm (the typical length of the male vocal tract), then
1500174
34000)14(2
500174
34000)12(1
F
F
etc.
So, for a neutral vowel (no constriction in the vocal tract),formants occur at 500, 1500, 2500, … Hz.
The Source-Filter Model: Vocal Tract Filter
The Source-Filter Model: Vocal Tract Filter
The Source-Filter Model: Vocal Tract Filter
The two-tube model can be expanded to multiple tubes;the math becomes ugly, but results are more realistic:
The Source-Filter Model: Bandwidths
In these cases, it has been assumed that the tubes havehard surfaces, which causes the resonant frequencies (formants)to have strong energy only at their center frequencies:
(energy is put into the system via the source, but no energy is lost)
In reality, the resonant energies decay over time; energyis absorbed by:
• viscosity (caused by friction of air against vocal-tract walls)• heat conduction (at the vocal-tract walls), • soft surfaces of vocal-tract walls
these effects cause bandwidth to increase with frequency
The Source-Filter Model: Radiation
A final effect of the speech-production process is radiationof sound from the lips
As sound radiates from a source, its energy decreases.
The decrease in energy is not the same for all frequencies;this effect can be modeled as a +6 dB/octave increase inenergy:
which, coincidentally, is the same equation as pre-emphasiswith a=1.0, and also corresponds to a differentiation operation.
The Source-Filter Model: Radiation
The derivative effect of radiation from the lips can bemoved to the glottal-source model:
T0
OQ
SK
glottal flow
glottal flowderivative
The Source-Filter Model: Radiation
The derivative effect of radiation from the lips can also bemoved to the models of frication and plosion:
Unvoiced Sound Source:
• a very simple model: random (white) noise
Plosive Sound Source:
• a very simple model: an impulse function
time
ampl
itud
e
The Source-Filter Model: Complete Picture
glottal source (harmonics)
vocal tract filter (envelope)
radiation (log scale)
final speech signal
The Source-Filter Model: Estimating Parameters
The vocal-tract parameters (formants) can be estimatedusing LPC analysis, with the order of LPC analysis equalto 2×NF, where NF is the expected number of formants.In practice, LPC estimation of formants is not very accuratebecause of slope of spectrum and irregularities in the spectrum.
Once the formants are determined, they can then be inverted, and the original signal filtered with the inverted formants to obtain the source + radiation (first derivative of glottal flow) signal.This is called inverse filtering.
The Source-Filter Model: Filtering
Formants can be modeled by a “damped sinusoid”, whichhas the following representations:
where S(f) is the spectrum at frequency value f, A is overallamplitude, fc is the center frequency of the damped sine wave, and is a damping factor. [Olive, p. 48, 58]. Or, given formantand sampling frequency, compute IIR filter coefficients:
2222
22
2)()2sin()(
cc
cc
t
fff
AffStfAetx
21102
102
21
0
1
frequency sampling 1
bandwidthformant ))/2(cos(2
frequencyformant
nnnn
s
fsf
f
BF
yayaxay
)a(aa
Fra
BFFra
Ferf
s
(from Klatt, 1980)
The Source-Filter Model
A course project that studies the source-filter model mightbe interesting…
1. Implement LPC, extract formant values and bandwidthsof different vowels; how do envelope and formant values change with different orders of LPC (values of p)?
2. Do LPC analysis, then inverse filter the signal to extract the glottal source waveform. Does it look the way it should?
3. Construct two-tube models, predict formant frequenciesof all vowels.
If you’re more comfortable with programming, signal processing,etc.