synthesis models: a review a - Politechnika Śląskamer.chemia.polsl.pl/biometrologia/materialy/speech/... · 2007-10-05 · Speech synthesis models: a review a Models of speech production

Speech synthesis models: a a review

Models of speech production and how these are used in text-to-speech conversion are reviewed. In the first part of the paper the foundation is laid for an explanation of present day speech synthesisers, and their limitations, through a phonetic description of speech production. The paper then presents a theorectical model of speech production which is the basis of most synthesisers. Next, a number of speech synthesisers are surveyed and their relative merits and shortcomings are considered. The paper ends with a brief look at techniques that are being used in place of the more traditional models of speech production in text-to-speech (TTS) systems and considers possible areas of progress in the future.

by A. Breen

1 Introduction Speech synthesis is becoming an increasingly common part of everyday life. Originally, a major use was seen to be in aids for the handicapped, for example reading machines for the blind. Recently a large number of applications have appeared in the area of telecommunications. As speech synthesis technology progresses, the number of applications will steadily increase; however, many applications are awaiting the time when text-to-speech (TTS) systems can produce natural sounding speech. Because of the importance of TTS systems, this article couches the description of speech production models within a framework of text-to-speech synthesis.

The process of converting unrestricted text into speech is shown schematically in Fig. 1. The first two stages convert the text into a symbolic linguistic/phonological description.’ In theory the linguistic component considers the semantic (meaning), pragmatic (knowledge) and syntactic (structure) elements of the text;

however, owing to the complexity of the analyses most TTS systems have only a limited, if any, semantic and pragmatic component and rely mainly on syntactic analysis. The phonological component converts the set of orthographic symbols (letters) into a set of distinctive features or sounds (phonemes) depending on the phonological model. The phoneme is the most popular form of phonological representation used in TTS systems.

The set of phonemes of a language can be described as representing ‘the smallest segments of sounds that can be distinguished by their contrast within words’.’ Table 1 gives a list of the phonemes of British English. The last two stages convert this abstract symbolic description into an acoustic signal.

To move from a symbolic representation to speech requires a model of speech production.’ It is this aspect of the synthesis process which is covered by this paper.

Attempts at making mechanical speaking machines date back to the Renaissance. One of the most successful devices of that period was produced by Wolfgang von Kempelemh Briefly, it consisted of a bellows which fed a reed connected to a leather cylinder which acted as a vocal tract. Air was expelled from the bellows using the left hand and the shape of the leather vocal tract was modified by the right hand. Mechanical speaking machines were replaced with the advent of electrical technology by electrical devices, one notable early version of which was the Voder (Voice DEmonstratoR)’ (Fig. 2), first demonstrated at the world fair of 1939 in New York. The Voder worked bv exciting a set of fixed filters, which acted as resonators. The resonators chosen to produce a particular sound were controlled by a human operator. However, it was only with the formulation of a

Table 1 SAM-PA (speech assessment methodology phonetic alphabet) machine-readable phonetic swbols for transcribing English4

P b m f D S

h i

0 V e1 01 U@

W

t d n V

S Z r tS I A U 3 @U I@ aU

k g N T

1 j dZ E a

Z

@ a1 e@

ELECTRONICS & COMMUNICATION ENGINEERING JOURNAL FEBRUARY 1992 19

I

lode1 of speech production that peech synthesisers as they are Iday appeared.

The human speech production system

11 speech synthesisers assume an nderlying model of speech roduction. Such models can be est appreciated if described in arallel with a brief description of le human mechanism of speech roduction and the characteristics E the speech signal.

rticulatoty phonetics Articulatory phonetics attempts describe the production of the

nguistically important sounds of a inguage in terms of the vocal

TEXT NORMALISATION.

expands dates, times, amounts of money etc., e g f 20 would be expanded to 'twenty pounds'

ORTHOGRAPHY CONVERTED INTO A SYMBOLIC LINGUISTIC

DESCRIPTION:

semantic. pragmatic. syntactic

word stress assignment letter-to-phoneme conversion

analysis of text

phonetic description of each

phoneme durations calculated phoneme

1 PRODUCTION OF SYNTHETIC SPEECH. I phonetic-to-acoustic

transformation

Block diagram showing the process f text-to-speech conversion. Inrestricted text is converted into an bstract linguistic description, which in mm is converted into synthetic speech ia a model of speech production

0

-mouth ~-

chambers

r vocal cords

resonance control

pedal Voder keyboard

2 Schematic diagram of the Voder

*

3 The articulators

ELECTRONICS & COMMUNICATION ENGINEERING JOIRNAL FEBRUARY 1992

irgans. Fig. 3 shows iiagrammatically the cross-section i f a vocal tract with the vocal irgans labelled. Most speech ;ounds are normally produced on a iulmonic egressive air stream, in ither words air is expelled from the ungs through muscular action.

Air leaving the lungs passes hrough a body of interlocking :artilage called the larynx (Fig. 4). The two major components of the arynx are the thyroid cartilage and he cricoid cartilage. In men the hyroid cartilage is set at a slight ingle, the front of which is :ommonly called the 'Adam's apple'. Nithin the thyroid and cricoid iartilages are the vocal folds (Fig. i), which consists of two fleshy nembranes attached by the vocalis n u d e to the inside wall of the .hyroid cartilage. The tension of the docal folds is mainly governed by .he vocalis muscles which form the 3ody of the vocal folds. The front of lhe vocal folds are brought together md attached to the thyroid Lartilage, while at the back they are attached to a pair of small zartilages called the arytenoids.

During normal breathing the vocal folds are abducted (held &part), allowing air to pass freely through the gap between the two folds (termed the glottis). During voiced speech (phonation) the vocal folds are repeatedly brought together and forccd apart. The tension in the folds is adjusted by tilting the thyroid cartilage, which is used to control the fundamental period of vocal fold oscillation.

A fundamental period can be divided simplistically into two portions, an open-glottis cycle (open phase) and a closed-glottis cycle. During the closed portion of the glottal cycle a pressure difference builds up between the pressure in the lungs and trachea and the external atmospheric pressure. The subglottal pressure on the folds forces them to move apart. Once they start to move apart air passes through the glottis. The particle velocity of the air through the glottis is high and a Bernoulli force is induced which, in conjunction with the muscular tension in the folds, tends to draw the folds hack together, eventually closing the glottis. This procedure is repeated over and over again. The theory is called the myoelaaticiaerodynamic theor?. of phonation. The time between each closure of the vocal folds is called the fundamental period To, the reciprocal of which is the fundamental frequency F,. Pitch and fundamental rrequcncy are not

laryngeal ventricle

vocalis musc

The larynx

'nonymous but are often used terchangeably owing to their ose correspondence. Fundamental tch is a tonal sensation as :rceived by a human listener

% glottis

whereas fundamental frequency is a property of the physical system. In contrast, voiceless sounds are produced when the vocal folds are sufficiently abducted to allow air to

thyroid cartilage

1 thyroarytenoid muscle

ocalis muscle

conus elasticus

/ arytenoid cartilage

cricoarytenoid muscle -I

Superior view of the larynx


alveolar ridge

loicing

6 Position of the vocal organs in the alveolar fricative in ‘six’

pass relatively unimpeded through the glottis. I vocal tract.

(b) by generating sound within the

A A ,A A A A A, A A A v v L v v v v \ / v v v

Voiced sounds are produced at the larynx: voiceless sounds however are normallv oroduced above the

The sound pressure wave above the larynx is modified by the vocal tract in one of two ways:

d l

(a) by modifying the spectral larynx, at some point of constriction within the vocal tract. For example, a voiceless fricative

distribution of the energy in the sound wave

_ _ - - - _

down

open

closed

7 Schematic representation of the production of the word ‘man’, showing how voicing, velum and lips alter during the production of the word

sound such as Is1 in ‘six’ is produced by turbulent air flow past a constriction made between the blade of the tongue and the roof of the mouth at the alveolar ridge (Fig. 6 ) .

In articulatory phonetics the consonant sounds of a language are described using three variables:

voice (described above) place

0 manner.

For example, consider a nasal consonant such as lml in ‘map’; the manner of articulation dtscribes how the sound is produced. For nasals, the oral cavity is closed completely and the velum lowered so that the sound may escape through the nasal cavity. The velum is a small flap of skin which divides the oral cavity from the nasal cavity (Fig. 3). The type of nasal consonant produced depends on the place of articulation (for nasal sounds the place of articulation describes where the oral cavity was occluded). In this example the place of articulation is at the lips, and so a bilabial nasal is produced. Fig. 7 shows this process schematically for the example ‘man’.

the following:

0 the position of the highest part of

Vowel sounds are described by

the tongue in the vocal tract (e.g. front or hack)

0 the distance of the highest part of the tongue from the top of the mouth

0 whether the lips are spread (e.g. the vowel lil in the word ‘tea’) or rounded (e.g. the vowel /U/ as in the word ‘food).

Acoustic phonetics

described in terms of their articulation, as above, or through an analysis of the speech signal.

Frequency domain analvses: A commonly employed method of displaying the spectral information in the speech signal uses a Fourier transform taken over a short time interval. In Fig. 8 a number of spectral slices show the effect of different manners of articulation on the signal spectnim. An alternative way of representing this type of information is the spectrogram.

Spectrograms use colour or grey scales to displav the intensities of a number of spectral slices. As such they are useful for displaying spectral variation with change in articulation (Figs. 9 and 10). Spectrograms show clearly how the resonances of the vocal tract change

The sounds of a language can be

22 ELECTRONICS & COMMUNICATION ENGINEERING JOURNAL FEBRUARY 1992

with changes in articulation and the marked difference in production of voiced and voiceless speech.

Time domuin aiialyws: In addition to the speech signal manv researchers wish to examine the voiced excitation signal. There are a number of methods for doing this. One method, called inverse filtering,* attempts to remove the effects of the vocal tract to leave just the differential volume velocity flow through the glottis (Fig. 11). To achieve good results using inverse filtering is both time consuming and difficult. Another, simpler method of examining the voiced excitation uses a device called an electrolaryngograph." This device measures the amount of conductivity across the larynx, and so can be used to estimate accurately when the vocal folds are in contact. Fig. 12 shows an cxample of a laryngographic trace. Using such a trace, i t is possible to estimate accurately the fundamental period and, somewhat less accurately, the point of glottal opening, from which a number of useful measures can be obtained.

From the above, i t is clear that speech production can be described using the articulators or by analyses of the speech signal. The description adopted during the design of a speech synthesiser greatlv affects the implementation of the modcl of speech production.

3 Modelling the speech production system

Most present-day synthesisers are based on a theory known as the source-lilter description of speech production.' This theory states that the speech signal can be viewed as being produced by a sound source or sources exciting a linear system (the vocal tract). The four possible combinations of sound sourcc are:

(a ) no sourcc (silence) ( b ) voiced source only ( c ) mixed voiced source and noise

source (d) noise source onlv (one or

several).

In many implementations of this theory the third combination o l sources is omitted. The decomposition of speech into a source-filter description can best be explained by a simple example (Fig. 13). Consider the portion of a vowel during the production of which both the voicing and articulation are held constant. The source consists of pulsed air flow through the glottis at the fundamental period. This can be expressed as a

iarmonic spectrum lS(j)l by Fourier ransformation of the time domain glottal waveform. Similarly, the nagnitude spectrum of the vocal ract transfer function can be

expressed as lT(f)l and the radiation characteristic of sound propagating from the head by IR(f)l. According to the source-filter theory of bpeech production, the spectrum of the

'O0 I 80

0 0.5 110 1:5 2'.0 2:5 3:O 3:5 4:O 415 5'.0 -20

frequency. kHz

'::I I

a

0 0.5 1-0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 frequency. kHz

b 1 0 0

60 m

$ 40

20

0

U

- 2 0 L 7 , , ,' ' . !, I

0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 frequency. kHz

c

8 Three spectral slices taken from (a) the voiceless fricative Ish (b) the vowel IEl; (c) the nasal consonant Id. Notice how the different manners of articulation exhibit clearly different spectral characteristics

ELECTRONICS & COMMUKICATION EKCINEERTYC JOURNAL FEBRUARY 1992

.Hz

9 -

8 -

7 -

6 -

5 -

4 -

3 -

2 -

1 -

- Wide-hand spectrogram of the phrase ‘four seven two’ spoken by a female Beaker. The spectrogram clearly shows the resonance structure in the speech p a l . Vocal tract resonances appear as brightly coloured hands of yellow and -ange. The fine red lines running vertically through the resonances are called triations’. These lines occur during voiced speech at the points of excitation

kHz -

9 -

8-

7 -

6 -

5 -

4 -

3 -

2 -

1 -

- 0 Narrow-band spectrogram of the phrase ‘four seven two’ spoken by a female peaker. This type of spectrogram shows clearly the harmonic structure of the Diced speech. The harmonics appear as parallel lines running across the pectrogram. Notice how the structure of the vocal tract resonances is less visible

1 Example of four cycles of inverse filtered speech

speech pressure waveform a short distance in front of the mouth can be expressed using eqn. 1 :

IP(f’) I = IS(f) IlTCf) IIR(f) I (1) Fant statess that ‘this is a process of synthesis which can be materialised and controlled in any detail in a speaking machine’. However, the realisation of such a speaking machine has proved to be be more difficult than this simple statement suggests.

The peaks in IT(f) I are referred to as formants. Formants are labelled F1, F2, F3, etc., in the order in which they occur in the frequency scale. It is clear from Fig. 13 that the formant structure visible in the final speech signal is a combination of the resonances (peaks) of the vocal tract transfer function and the harmonic structure of the source. When the harmonics of a speech signal are widely separated, formant peaks may fall between harmonics leading to the erroneous visual impression that the formant structure of the signal is related to the fundamental frequency. The theory states that to a first order approximation the .source is independent of the vocal tract and thus formants can only change position as a result of changes in articulation. However, to produce good quality synthetic speech, it has proved necessary to modify this simple assumption to accommodate the effects of source- tract interactions.

Models of the vocal tract transfer function

The vocal tract has been modelled in two ways. Models based on articulation attempt to describe the vocal tract using a number of tubes of differing areas to represent the shape of the tract at various points along its length. The pressure and volume velocities of the sound wave at the boundaries of these tubes can be adequately described up to 5 kHz using a transmission line analogue of plane wave propagation through each section. Underlying such articulatory models is an articulatory description of speech production. Terminal analogue models on the other hand make no attempt to parameterise vocal tract areas. Instead they combine linear- filter complex conjugate pole pairs to model the major resonance peaks in the vocal tract, and occasionally zeros (antiresonators) to model troughs in the vocal tract transfer. Spectral troughs occur when the nasal cavity is coupled in with the oral cavity (e.g. during the


time, ms

12 Three cycles of (modal) voice (top) with corresponding laryngographic trace (bottom)

xoduction of nasal sounds). Underlying terminal analogue

models is an acoustic description of speech production. There are two main problems with the articulatory approach to speech synthesis. First, the data required to ievelop such models is difficult to

collect and, secondly, the models produced are computationally very expensive. Because of thesc problems, most TTS systems use synthesisers that attempt to model the dominant resonances of the vocal tract (formant synthesisers). Because formant models only use

the information contained in the acoustic speech signal, the data required to control such synthesisers is more readily available. Also, terminal analogue models are much simpler models of speech production. They do not necessarily attempt to represent the

0 U c c 2p- !r* .-

m B E P E

frequency frequency frequency frequency

13 Block diagram representing the source-filter theory of speech production


a

F2 -

FN U

b

14 (a) Cascade of filter sections; ( b ) filter sections combined in parallel

speech signal faithfully, rather they represent only the perceptually significant aspects of i t and as such require fewer control signals.

Because of the above, real-time speech synthesis is possible using

such models with the aid of digital signal-processing hardware. There have been a number of formant synthesiser designs, all of which agree on the underlying theory. However, there is much less

Table 2 Dynamically variable control signals used in the Holmes parallel formant synthesiser

Symbol Description Min. input Max. input

1 ALF - low-frequency amplitude 1 63

2 A1 -amplitude of F1 formant, dB 1 63 3 A2 - amplitude of F1 formant, dB 1 63 4 A3 amplitude of F1 formant, dB 1 63 5 AHF - amplitude of high-frequency 1 63

6 frequency of nasal formant, Hz formant frequencies in the range 100 Hz to

7 frequency of FI formant, Hz 3825 H z quantised to 256 values each

8 9

contr-ol, nasal formant, dB

formant, dB

frequency of F2 formant, Hz frequent) of F3 formant, Hz

frequency I O F , - quantised log fundamental 1 63

11 VMIX - degree of voicing 1 63

12 I M S - glottal pulse markispace 1 63 voiceless fully voiced

ratio

The parameters ALF and AHF are amplitude contr-ols for the low- anc high-frequency resonators, respectively. The parameters VMIX and IMZ are used to control the degree of voicing and voiced excitation marwspacc ratio. Changing the TMS control value is similar in effect to changing thc length of the open phrase of the voiced source. The values 1 to 6: represent the rangc of lcgal input to the formant synthesiser.

agreement on exactly how such a model should be implemented.

The problem is that, when implementing a synthesiser, compromises must be made in how well various aspects of the speech signal are modelled. Two solutions to this problem have been proposed: speech synthesisers which combine their filter sections in a cascade or in a parallel arrangement.

The two approaches are shown pictorially in Fig. 14. Both methods have advantages and disadvantages.'" Basically, cascade designs arc better at modelling vowel sounds as they model more closely the vocal tract with no nasal coupling. Also, as the resonators are cascaded, individual formant amplitude controls are not needed in the production of vowels. On the other hand, parallel designs are less sensitive to internally generated quantisation noise and, since individual formant amplitude controls are required, there is greater flexibility in the type of sounds that can be produced. This is a particular advantage when modelling obstruent sounds such as plosives, nasals and fricatives. Because of the need to model non- vowel sounds, cascade models normally contain extra parallel branches and as a result tend to be more complex in design. Figs. 15a, b and c show block diagrams of three of the most widely used formant synthesisers. The OVE synthesiser" (Fig. 1%) is basically a cascade model with extra branches to accommodate nasal and fricative sounds, whereas the Holmes synthesiser" (Fig. 1%) is a purely parallel design. The Klatt synthesiser" (Fig. 15c) may be operated in either cascade or parallel mode; normally it is used as a cascade design.

Table 2 shows the dynamically variable control signals used in one particular formant synthesiser." In this synthesiser, the control signals are updated every 10 ms.

The terminal analogue models described above are all based on a frequency domain analysis of the speech signal. However, a related class of speech synthesisers based on linear prediction (LP) analysis derive their transfer runction from the time domain.'' In the analysis of a speech signal, linear prediction attempts to model the speech production system as described in Section 2. The model used is a further simplification of the source- filter description of speech production. The LP model combines the spectral

26 ELECTROVICS & COMMLNICATIOY ENGINEERING JOLRNAL FEBRUARY 1992

a

*

voicing voiced control excitation

b

cascade transfer function

&

impulse generator t+ AH

modulator ~

Fo 0

lrandom I number generator

A6 C

AB

5 (a) OVE I1 synthesiser comprising three separate branches, for the production of (top) vowels, (middle) nasals, aspiration nd (bottom) fricatives. AO, AH, AN and AC are variable amplitude controls. FlLF3 and KO-= are, respectively, variable mnant and fricative resonator and antiresonator frequency controls. ( b ) Block diagram of the Holmes parallel formant mtbesiser. ALF, AHF, A1-A3 are amplitude controls; FN, Fl-F3 are variable resonator frequency controls. ( c ) Block diagram F the matt cascadelparallel formant synthesiser. AH, AF, AVS, AV, AN, AB, A1-A6 are amplitude controls. RGP, RGS, RNP, NZ, RILR6 are variable resonator and antiresonator frequency controls.


16 Example of four cycles of inverse filtered speech (differentiated glottal flow) superimposed on which is an example (red line) of the type of contour produced ~y models of glottal flow

characteristics of the source, the vocal tract and the radiation at the lips within the filter transfer function. In this model the sampled speech signal s, is considered to be the output of a system driven by an input ti, such that the following relation holds:

P Y

s,=- 1 a kS,./,+G @n-/ ( 2 )

where a, [ 1 s k s p ) , and G (the gain) are the parameters of the system. Eqn. 2 states that the 'output's, is a linear combination of its past outputs and present and past inputs. This equation can be expressed in the Z-domain as follows:

k = 1 / L O

(1 < i ~ 4 1

P o = 1 , q = 1

Eqn. 3 implies a model which has both poles and zeros, however in most implementations an all-pole model is used in preference to the pole-zero model as efficient algorithms for determining the filter coefficients exist. Using an all- pole model is acceptable because the type of spectral contributions produced by zeros (local dips in the spectra and changes in the spectral balance) can be approximated by an all-pole model and are considered perceptually less important. Linear prediction models are used

28

extensively in speech coding as a method of data reduction, in speech analysis for feature extraction and recently as speech production models in TTS systems.

Voiced source models There have been a number of

vocal fold models, varying in complexiLy from simple one- or two-mass system models'j to 16 mass systems that are capable of modelling vertical shear effects and lateral mucosal effects. Recentlv, finite element analysis has been used to model glottal flow.I6 In these models the air flow through the glottis is assumed to be a viscous fluid flow and the vocal folds are assumed to be viscoelastic bodies. Such numerical methods of voiced source modelling are complex and computationally expensive. The models described above have all attempted to model the mechanics of the vocal folds. Other methods" (Fig. 16) have attempted to model the glottal flow without reference to the larynx physiology, basing the models on examinations of the glottal flow derived from inverse filtering. It is these models which have been applied most recently to formant synthesisers.

4 Voiced excitation effects Originally, formant synthesiseri used relatively simple models of the voiced source which were assumed to be independent of the vocal tract in accordance with the source-filter description of speech production. Any interaction between the voiced excitation and the vocal tract was considered to be a secondary cffcct and unlikely to affect significantly

.he quality of the synthetic speech. For example, the Holmes synthesiser used a single doubly iifferentiated instance of a volume Jelocity flow waveform to drive the -esonators. The stored waveform ,vas compressed or expanded in .ime to accommodate different undamental periods. This source node1 retained the phase Iharacteristics of a glottal pulse but xoduced a flat excitation spectrum above 1000 Hz. Information on the -equired spectral slope for the source was incorporated into the Formant amplitude control signals. Other synthesisers, such as the Klatt, used simplified models of the voiced source, which did not retain the phase characteristics of the signal but approximated a desired spectral slope. It was shown many years ago that formant synthesisers with only simple models of the voiced excitation could produce high-quality male speech,l8 but it is only recently, with the inclusion of better source models and a greater understanding of how the voiced source contributes to the naturalness of speech, that high- quality female speech has been achieved." Voiced-source effects can be grouped into three broad classes as follows:

voicing-mode effects glottal pulse timing interactions source-tract interactions.

In voicing-mode effects consideration is given to the behaviour of the vocal folds during phonation. Typically the vocal folds exhibit clear types of vibration. These include:

(a ) modal voice (Fig. 12), the most common form of vocal fold vibration. Simplistically, modal voice can be described as consisting of two clearly defined phases during a typical larynx cycle: an open phase in which the vocal folds are apart, lasting typically for 60% of the cycle: and a closed phase in which the vocal folds are together, completely closing the glottis.

(b ) breathy voice (Fig. 17). During breathy voice the vocal folds may never completely occlude the glottis during a glottal cycle. This results in a breathy quality persisting throughout the larynx cycle and a reduction in the amplitude of the higher harmonics. Breathy voice may occur at both the beginning and end of phonation and is more common in female speakers, particularly at high

ELECTRONICS & COMMUNICATION ENGINEERING JOURNAL FEBRUARY 1992

7

fundamental frequency values. [c) creak (Fig. 18), often called

vocal fry or laryngealised voice (pressed voice). This type of vocal fold behaviour is associated with a very short open phase and a low fundamental frequency.

Glottal pulse timing effects are characterised by a cycle-by-cycle variation in the fundamental period, often called ‘jitter’, and a cycle-by-cycle variation in the glottal-pulse amplitude, often called ‘shimmer’. It has been observed that variations in the fundamental period are perceptually significant and lead to a perception of roughness in the voice.

of the following effects:

(i) ripple in the voiced source waveform. During the larynx cycle a standing wave is generated in the lower pharyngeal portion of the vocal tract. However the transglottal pressure is not constant over a larynx cycle. This change in pressure interacts nonlinearly with the standing wave, causing ripples to appear in the voiced source. The standing wave may also cause a change in the mechanical behaviour of the vocal folds when lhe first formant frequency is close to an integer multiple of the fundamental frequency.

of the sub-glottal cavity to the supra-glottal cavity during the open phase effectively changes the impedance of the source driving the vocal tract, which causes heavy formant damping (particularly in the first formant) and shifts in formant frequencies.

Source-tract interactions consist

(ii) formant damping. The coupling

111 three of the synthesisers nentioned in the last Section have .ecently had their models of voiced :xcitation improved to include ,ome, if not all, of the effects lescribed above.

5 Phoneme-to-speech

The set of control signals used to Irive a formant synthesiser may be iroduced in two ways: by analysis )r by rule. In synthesis-by-analysis, iften called ‘copy synthesis’, an ittempt is made to derive a set of ;uitable synthesiser control signal d u e s through an intensive analysis i f the original speech. As previously iuggested, synthetic speech iroduced in this manner can be of a 7ery high quality. The success of

conversion

closed gloti

open glott - 700

Y

705 710 time, ms

7 Five cycles of breathy voice (top) with Bottom)

opy synthesis is, to a large extent, ue to the fact that very few ssumptions are made about the dative importance to the aturalness of speech of particular coustic events. In effect the only ssumptions used are those made uring the design of the synthesiser. I contrast, in synthesis-by-rule an ttempt is made to produce natural xmding synthetic speech from 3me abstract description.

mresponding laryngographic trace

Converting from this abstract description to a set of synthesiser control signals requires a model of speech perception. If the model is wrong or overly simplistic the naturalness of the synthetic speech is effected. In a TTS system the model of speech perception is normally embodied in a number of different elements, four of the mosi important of these are listed below

An intonation model. Intonation

c 0

: : o

-1673 closed g l o n i s p

open glottis i 910 920 930 940 950 960 970 980

time, ms 990

~

18 Example of creak (top) with corresponding laryngographic trace (bottom). During creak the vocal folds vibrate in a biphasic pattern: this pattern is clearly visible in the bottom trace as a full closure and partial closure during each larynx cycle

ELECTRONICS Cy. COMMUNICATION ENGINEERING JOURNAL FEBRUARY 1992 29

nay be defined as the pattern of ,itch changes in an utterance. In Znglish, intonation is used as a cue o signal syllabic stress, and also to :onvey a great deal of information, iuch as sarcasm, questioning and issertion, as well as emotions such i s happiness and sadness. In tone anguages (e.g. Chinese) intonation s also used to change the meaning ,f words.

A model of phoneme durations. Phoneme durations are another mportant cue to syllable stress and .o the overall rhythm of an dtterance. The duration of a ?honeme is greatly affected by speaking rate, the surrounding phonemes and whether the phoneme is positioned at a syllable, word or phrase boundary.

A model of voiced source effects. 4s observed in the previous Section accurate modelling of these effects is important to the naturalness of synthetic speech. This is particularlv true tor female synthetic speech.

vules. To convert a string of phonemes into a set of formant synthesiser parameter tracks requires a model which describes how phonemes interact with changing articulation. Many present day TTS systems (e.g. the MITalk system”’) use a method based on the pioneering work of Holmes, Mattingly and Shearme.” In its broadest form this method uses a set of tables to store phonetic information about the phonemes. The tables contain, among other things, target (expected) formant values for the phonemes. During the production of speech, rules are used to interpolate between these target values. The model used to design these rules must accommodate important effects observed in real speech, such as coarticulation (the modification of a sound due to the articulation of its neighbours) and changes in the spectral characteristics of a sound due to speaking style. The complexity of these rules varies considerablv between TTS systems.

6 Concatenation methods of

So far this article has only considered TTS systems which use a model of speech production as part of the synthesis process. The complexity of producing adequate phoneme-to-speech production rules has led a number of researchers to investigate alternative methods of speech synthesis that do not require an explicit model of production.

A set ofphoneme production

synthesis

Concatenation methods generate synthetic speech by concatenating segments of encoded speech and so lo not require production models 3r interpolation rules: only Jhoneme duration and iundamental frequency need to be specified. However, because no qroduction model or interpolation ules are used, examples must be :ollected for as many phonetic environments as possible. Ultimately, concatenation systems ire limited by the number of mvironments that can he realistically stored, and as such are theoretically inferior to methods iyhich employ models of speech production and so can extrapolate From a very limited set of abstract jymbols.

The coding strategies used differ in complexity, ranging from stored speech samples to models of production such as linear prediction. All methods, whatever lhe coding scheme, have one thing in common: the spectral Zharacteristics of the vocal tract (and in many instances the 2xcitation) are encoded within the basic speech units. A natural choice af unit would appear to be the ,yllable, however there are over 10 000 syllables in English making it an unrealistic candidate. Smaller units have therefore been adopted such as triphones, which consist of three phonemes, and more zommonly diphones, which consist of the transitions from the centre of one phoneme to the centre of another phoneme. There are approximately 2000 diphone units in English. Such units are attractive because they represent a zompromise between the number of units required and the amount of coarticulatory information included. However, diphones still suffer from a lack of coarticulatory information unless taken from an environment similar to that of the synthetic phrase. In an effort to reduce both coarticulatory effects and segment mismatches at unit boundaries manv researchers are investigating ways of ensuring that speech segments are extracted from appropriate contexts. Methods of incorporating contextual information into the unit selection procedure include clustering techniques” and the use of databases containing annotated words, phrases and sentences.z’ In these methods the unit selected from a database or cluster tree are those which match the desired unit and most closely fit the context requirements of the unit in the svnthetic phrase.

7 Conclusions Over the past two decades there have been major advances in our understanding of how to model speech production. This paper has reviewed a number of speech synthesis models which, given the right parameters, can produce highly natural sounding male and female synthetic speech. Unfortunately, the methods presently used to derive such parameters from text cannot achieve a high level of naturalness. In the short term, the lack of success in developing phoneme-to- speech conversion rules has been offset to some extent by advances in methods of concatenation. Presently, systems based on concatenation can offer a higher level of segmental naturalness than those employing a speech synthesiser. The hope is that, in the longer term, methods for driving formant or articulatory based synthesisers will have reached a level of sophistication capable of producing highly natural sounding speech.

Acknowledgment To my colleagues at British Telecommunications, Maggie Gaved, Tim Gillott, Martin Hall, Andrew Lowry, Stephen Macgregor and Ian Payton.

References 1 KLATT, D. H.: ‘Review of text-to- speech conversion for English, J. Accoust. Soc. Am., 1987. 82. (31, pp. 737-793 2 CARLSON, R.. GRANSTROM, B., and HUNNICUTT, S.: ‘Multilingual text-to- speech development and applications’. Advances in Speech, Hearing and Language Processing, 1990, 1, PP. .. 269-29; 3 LADEFOGED, P.: ‘A course in nhonetics’ (Harcourt Brace Jovanovich. 1982) 4 FOURCIN, A. J., HARLAND, G . , BARRY, W.. AND HAZAN. V.: ‘Speech input and output assessment. Multilingual methods and standards’ (Ellis Horwood, Chichester, 1989) 5 FANT, G.: ‘Acoustic theory of speech pi-oduction’ (Mouton, ’s Gravenhage, The Netherlands, 1960) 6 DUDLEY, H., and TARNOCZY, H.: ‘The speaking machine of Wolfgang von Kempelen’, J . Acoust. Soc. Am., 1950, 22, ( I ) , pp. 151-166 7 DUDLEY, H., RIESZ, R. R.. and WATKINS. S. A.: ‘A synthetic speaker’, J. Franklin Inst., 1939, 227, pp. 739-764 8 CHAN, D. S. F., and BROOKS, D. M.: ‘Variability of excitation parameter derived from robust closed phase glottal inverse filtering’. European Conf. on Communication and Technology, Paris, 1989, pp. 199-202 9 FOURCIN, A. J.: ‘First applications of a new laryngograph’, Medical & Biological Ilhstration, 1971, 21, pp. 172-182 10 HOLMES, J. N.: ‘Formant


synthesisers: cascade or parallel?’, Speech Communication. 1983, 2, pp 251-273 11 LIJIENNCRANTS, J.: ‘The OVE 111 speech synthesiser’, IEEE Trans., 1968,

12 HOLMES, J. N.: ‘A parallel-formant synthesiser for machine voice output, Chap. 7 in FALLSIDE, F., and WOODS, W. A.: ‘Computer speech processing’ (Prcnctice Hall, 19851, pp. 163-187 13 KLATT, D. H.: ‘Software Cor a cascadeiparallel formant synthesiser’, J . Acoust. Soc. Am., 1980, 67, pp. 971-955 14 MAKHOUL, J.: ‘Linear prediction: a tutorial review’, Proc. IEEE, 1975, 63,

15 FLANAGAN, J. L., and ISHIZAKA, K.: ‘Computer model to characterise the air volume displaced by the vibrating vocal folds’, J . Acoust. Soc. Am., 1978, 63, pp. 1559-1565 16 IIJIMA, H., MIKI, N., and NAGAI, N.:

AU-16, ( 1 1 , pp. 137-140

pp. 561-580

‘Glottal flow analysis based on a finite element simulation of a two-dimensional unsteady viscous fluid’. Proc. Int. Conf. on Suoken Language Processing, Jauan, - - 1 . 1990, 1, pp. 77-80 17 FANT, G.. LIJIENNCRANTS, J., and LIN 0.: ‘A four-oarameter model ot glottal flow’. ST‘L-QPSR 4 (Royal Institute of Technology Speech Transmission Laboratory, Stockholm, Quarterly Progress Report), 1985, pp. 1-13 18 HOLMES, J. N.: ‘The influence of glottal wavcform on the naturalness of speech from a parallel tormant synthesiser’, IEEE Trans., 1973. AU- 21, pp 298-305 19 KLATT D H and KLATT, L C ‘Analysis, synthesis, and perccption of voice quality variations among female and male talkers’, J . Acoust. Soc. An?.,

20 ALLEN, J., HUNNICUTT, M., and 1990,57, (2). pp. ~ 2 0 - 8 ~ 7

Book Reviews Computer communication networks Gill Waters (Ed.) McGratv-Hill 1991, 37Spp., €3.5 ISBN 0 077073258

During the past ten years financial pressures, increased industrial awareness and demand for wider access have forced British universities to take greater commercial advantage of their inherent strengths. One such area is lhe provision of taught courses for commercial consumption without the requirement of registering for any formal qualification. Indeed, one of the few commercial sectors to be relatively unaffected by the current recession is that of industrial training courses, and in particular courses related to zommunications and personal zomputer usage and awareness. It seems that British industry has, at least in certain fields, finally become aware of the need to train its staff, both to provide a more relevantly educated workforce and to improve their commercial zffectiveness.

Given this backdrop it is no surprise, therefore, that the better departments in the stronger universities are exploiting this demand and are offering a variety 3f introductory courses in the field 3f communications. The next step in this exploitation is to take the notes of such courses, to edit them ippropriately and finally to release them a a book. The book ’Computer Zommunication networks’ is such a .ext and is taken from a series of short courses run by the

Department of Electronic Systems Engineering at the University of Essex. As such it is the fourth publication in the ‘Essex series in telecommunication and information systems’.

The text is based upon fifteen chapters written by nine authors from both academic and industrial backgrounds. The topics covered are computer networks, point-to- point and broadcast communication techniques, wide- area network design issucs, open systems interconnection, proprietary network architectures, local-area networks, network interconnection, secure communications, distributed computing systems, network management, voiceidata integration and, finally, advances in networking. Both the strength and weakness of the book is its scope, there being a bit of most things in its 375 pages. This is definitely a book which should be used as a basic introduction. But beware: it is 1 little uneven, with some areas (for 2xample security) being too detailed, so that the hook’s balance is lost.

There are two serious flaws in the content: the information in some areas is either badly out of date or incomplete and there is little in the way of actual case study or 2xperience detail. An example of the €ormer is the paucity of information 3n topics such as internetworking, systems integration, network :>perating systems and how protocols really work. As for the latter, this could oniy have been rectified by the inclusion of some 2nd users as authors. The other

KLATT, D.: ‘From text to speech: the MITalk system’ (Cambridge University Press, 1987) 21 HOLMES, J. N. , MATTINGLY, I. G., and SHEARME. J. N.: ‘Speech synthesis by rule’, Lunguage & Speech, 1964, 7, pp. 127-143 22 NAKAJIMA, S., and HAMADA, H.: ‘Automatic generation of synthesis units based on context oriented clustering’. Proc. ICASSP 88, 1988, 1, pp. 659-662 23 HIROKAWA, T.: ‘Speech synthesis using a waveform dictionary’. Proc. Euro. Conf. Speech Com., 1989, pp. 140-143

0 IEE: 1992 First received 6th September and in revised form 22nd November 1991

Dr. Breen is with BT Laboratoriea. Martlesham Heath, Ipswich IP5 7RE, UK.

worry for potential purchasers is the overlap with the standard texts of either Halsall or Tanenbaum or Stallings. Only Stallings has completed updating his text but new editions from the others will be with us some time in 1992. There is, surprisingly, no real concern here as the use of several different authors makes this a good complement to the other texts ~ it should still fulfil this r81e with the arrival of the new editions.

This bdok is a product of its history, from which it derives both its weaknesses and strengths. As the book of a course it acts as an excellent substitute for those who cannot attend the course or those who want a more formal presentation. It is also a welcome addition for those of us who need as much help as possible in understanding the intricacies of communications, but take care to remember that this is only the starting point.

C. SMYTHE

Common channel signalling Richard J. Manterfield Peter Peregrinus 1991, 221pp., €38 ISBN 0 863412408

Targeted at a wide readership profile - lrom the novice through to experts already in the field of common channel signalling - this books 2 14 text pages offer 1 1 chapters. Commencing in chapter 1 with principles of signalling systems, the book goes on to offer an interesting and concise tour of existing proprietary channel-

ELECTRONICS & COMMUNICATION ENGINEERING JOURNAL FEBRUARY I992 31

Documents

synthesis models: a review a - Politechnika Śląskamer.chemia.polsl.pl/biometrologia/materialy/speech/... · 2007-10-05 · Speech synthesis models: a review a Models of speech production