Download ppt - CS 551/651: Structure of Spoken Language Lecture 9: The Source-Filter Model of Speech Production John-Paul Hosom Fall 2008

CS 551/651:Structure of Spoken Language

Lecture 9: The Source-Filter Modelof Speech Production

John-Paul HosomFall 2008

The Source-Filter Model

One more model of speech… proposed in 1848 by JohannesMüller, developed by Gunnar Fant circa 1970. Also calledthe “Acoustic Theory of Speech Production”.

The Source-Filter Model provides a static description of speech;speech dynamics are dealt with in models of coarticulation.

According to this model, speech is defined by three parts:

1. A sound source vibration of the vocal folds, air turbulence, or plosion

2. A tube through which the source passesthe vocal tract

3. Radiation of sound from the mouth

These 3 components are assumed to be independent.

We will discuss these three parts separately

The Source-Filter Model: Sound Source

Voiced Sound Source:

• produced by vibration of the vocal folds• several models exist that describe the flow of air through

the vocal folds• each model describes the increase in air flow as the glottis

opens, decrease in air flow as it closes, and no air flowas glottis remains closed during pressure buildup.

• in spectral domain, shape is approximately flat at very lowfrequencies, and has –12 dB/octave slope at higher freq.

Models: Rosenberg, Fant (LF model), Fujisaki (FL model), Klattglottis opening glottal closure glottis opening

air

pres

sure

(P

a)

time (msec)


Voiced Sound Source:

• models are of “glottal flow”• glottal flow is the same as volume velocity, V in units of m3/s• volume velocity per unit area, or V/unit area, is in units of

m/s, and is called the point velocity, v. • acoustic pressure, p, in Pascals, equals impedance Z times v:

p = Z v• impedance is constant for a given glottis and vocal tract• therefore, acoustic pressure is directly proportional to

glottal flow, and so the vertical axis of these models canbe considered either glottal flow, volume velocity, or acoustic pressure (in micro Pascals).


All models have the following parameters:• pitch period = 1/F0 = T0

• open quotient (OQ)• skew (SK)

These three parameters are used in a function that describes howthe sound pressure changes over time within one pitch period.

glottis opening glottal closure glottis opening

T0

OQSK

OQ measured relative to T0;SK measured relative to OQ


The Rosenberg model:

(from http://www.physik3.gwdg.de/~micha/aachen98/aachen98.html)

gR(t) is glottal pulse with amplitude A and duration T;gR(t) has three phases: the opening phase until time TO, the closingphase until time TC, and the closed phase with length T-(TO+TC)

TO

TC T

Ei

0 Ti Tp Te TcTa


The Liljencrants-Fant (LF) Model:

(from http://www.ims.uni-stuttgart.de/phonetik/EGG/page13.htm)

• uses sin() and exp() functions to create smooth trajectory• many parameters allow detailed control of shape

The Fujisaki-Ljungqvist (FL) Model:• similar to LF, but allows negative flow during closed phase• simpler polynomial functions


Unvoiced Sound Source:

• produced by pushing air through constriction in mouth

• a simple model: noise that decreases at –6 dB/octave

Plosive Sound Source:

• produced by pressure buildup, then release of constriction

• a very simple model: approximately a step function

time

ampl

itud

e

The Source-Filter Model: Vocal Tract Filter

The vocal tract can be modeled as a series of connectedtubes with different lengths and diameters:

A1 A2 A3 A4 A5 A6

l4

d4

Life can be made much more simple if we start withonly two tubes for approximating different vowels:

A1 A2 A1 A2

A1 A2A1 A2

/iy/

/aa/

/uw/

/ah/


An electrical-engineering analogy can be drawn betweenthe tubes and a transmission line.

From this analogy, the formant frequencies (frequencies of standingwaves) occur when

where

(from Flanagan, p. 70-71)

)cot()tan( 122

1 llA

A

m/s3402

cc

f


In the simplest case of a single tube, the formants are located at

l

ciFi 4

)12(

and if l = 17cm (the typical length of the male vocal tract), then

1500174

34000)14(2

500174

34000)12(1

F

F

etc.

So, for a neutral vowel (no constriction in the vocal tract),formants occur at 500, 1500, 2500, … Hz.




The two-tube model can be expanded to multiple tubes;the math becomes ugly, but results are more realistic:

The Source-Filter Model: Bandwidths

In these cases, it has been assumed that the tubes havehard surfaces, which causes the resonant frequencies (formants)to have strong energy only at their center frequencies:

(energy is put into the system via the source, but no energy is lost)

In reality, the resonant energies decay over time; energyis absorbed by:

• viscosity (caused by friction of air against vocal-tract walls)• heat conduction (at the vocal-tract walls), • soft surfaces of vocal-tract walls

these effects cause bandwidth to increase with frequency

The Source-Filter Model: Radiation

A final effect of the speech-production process is radiationof sound from the lips

As sound radiates from a source, its energy decreases.

The decrease in energy is not the same for all frequencies;this effect can be modeled as a +6 dB/octave increase inenergy:

which, coincidentally, is the same equation as pre-emphasiswith a=1.0, and also corresponds to a differentiation operation.


The derivative effect of radiation from the lips can bemoved to the glottal-source model:

T0

OQ

SK

glottal flow

glottal flowderivative


The derivative effect of radiation from the lips can also bemoved to the models of frication and plosion:

Unvoiced Sound Source:

• a very simple model: random (white) noise

Plosive Sound Source:

• a very simple model: an impulse function

time

ampl

itud

e

The Source-Filter Model: Complete Picture

glottal source (harmonics)

vocal tract filter (envelope)

radiation (log scale)

final speech signal

The Source-Filter Model: Estimating Parameters

The vocal-tract parameters (formants) can be estimatedusing LPC analysis, with the order of LPC analysis equalto 2×NF, where NF is the expected number of formants.In practice, LPC estimation of formants is not very accuratebecause of slope of spectrum and irregularities in the spectrum.

Once the formants are determined, they can then be inverted, and the original signal filtered with the inverted formants to obtain the source + radiation (first derivative of glottal flow) signal.This is called inverse filtering.

The Source-Filter Model: Filtering

Formants can be modeled by a “damped sinusoid”, whichhas the following representations:

where S(f) is the spectrum at frequency value f, A is overallamplitude, fc is the center frequency of the damped sine wave, and is a damping factor. [Olive, p. 48, 58]. Or, given formantand sampling frequency, compute IIR filter coefficients:

2222

22

2)()2sin()(

cc

cc

t

fff

AffStfAetx

21102

102

21

0

1

frequency sampling 1

bandwidthformant ))/2(cos(2

frequencyformant

nnnn

s

fsf

f

BF

yayaxay

)a(aa

Fra

BFFra

Ferf

s

(from Klatt, 1980)

The Source-Filter Model

A course project that studies the source-filter model mightbe interesting…

1. Implement LPC, extract formant values and bandwidthsof different vowels; how do envelope and formant values change with different orders of LPC (values of p)?

2. Do LPC analysis, then inverse filter the signal to extract the glottal source waveform. Does it look the way it should?

3. Construct two-tube models, predict formant frequenciesof all vowels.

If you’re more comfortable with programming, signal processing,etc.