Festival Speech Synthesis System

8/14/2019 Festival Speech Synthesis System

1/19


2/19

Festival Speech Synthesis System

3

ABSTRACT

The Festival Speech Synthesis Systems was developed at the Centre for Speech

Technology. Research at the University of Edinburgh3 in the late 90's. It offers a free,

portable, language independent; run-time speech synthesis engine for various

platforms under various APIs.Festival is just the engine that we will use in the process

of building voices, both as a run-time engine for the voices we build and as a tool in

the building process itself.

Festival offers a general framework for building speech synthesis systems as well as

including examples of various modules. As a whole it offers full text to speech

through a number APIs: from shell level, though a Scheme command interpreter, as a

C++ library, from Java, and an Emacs interface. Festival is multi-lingual (currently

English (British and American), and Spanish) though English is the most advanced.

Other groups release new languages for the system.


3/19


4

CONTENTS

1. INTRODUCTION.5

2. TEXT TO SPEECH SYSTEMS.. .6

3. GENERAL ANATOMY OF A SYNTHESIZER.........7

4. CORE SYSTEM....8

5. ARCHITECTURE.. 10

5.1 String processing...105.2 multi-level data structures.....115.3 Relations125.4 Items&Features..135.5 Utterences..13

7. TEXT ANALYSIS.14

8. Linguistic/Prosodic Analysis...16

8.1 Lexicons168.2: Lexicons and addenda168.3 Letter to sound rules...178.4 Post lexical rules.17

9. WAVEFORM GENERATION..18

10. DIPHONE SYNTHESIS..19

10.1 Recording diphones19.10.2 labeling diphones....19

8. CONCLUSION...20


4/19


5

INTRODUCTION

The idea that a machine could generate speech has been with us for some time, but

the realization of such machines has only really been practical within the last 50

years. Even more recently, it's in the last 20 years or so that we've seen practical

examples of text-to-speech systems that can say any text they're given -- though itmight be "wrong."

The creation of synthetic speech covers a whole range of processes, and though often

they are all lumped under the general term text-to-speech , a good deal of work has

gone into generating speech from sequences of speech sounds; this would be a

speech-sound (phoneme) to audio waveform synthesis, rather than going all the way

from text to phonemes (speech sounds), and then to sound.

One of the first practical applications of speech synthesis was in 1936 when the U.K.

Telephone Company introduced a speaking clock. It used optical storage for the

phrases, words, and part-words ("noun," "verb," and so on) which were appropriately

concatenated to form complete sentences.

While speech and language were already important parts of daily life before the

invention of the computer, the equipment and technology that has developed over the

last several years has made it possible to have machines that speak, read, or even

carry out dialogs. A number of vendors provide both recognition and speech

technology, and there are several telephone-based systems that do interesting things.

The Festival Speech Synthesis Systems is a free, portable, language independent, run-

time speech synthesis engine for various platforms under various APIs.


5/19


6

TEXT-TO-SPEECH SYSTEMS

Text-to-speech systems have an enormous range of applications.

Their first real use was in reading systems for the blind, where a system would read

some text from a book and convert it into speech. These early systems of course

sounded very mechanical, but their adoption by blind people was hardly surprising as

the other options of reading Braille or having a real person do the reading were often

not possible. Today, quite sophisticated systems exist that facilitate human computer

interaction for the blind, in which the TTS can help the user, navigate around a

windows system.

The mainstream adoption of TTS has been severely limited by its quality. Apart from

users who have little choice (as in the case with blind people), peoples reaction to old

style TTS is not particularly positive. While people may be somewhat impressed and

quite happy to listen to a few sentences, in general the novelty of this soon wears off.

In recent years, the considerable advances in quality have changed the situation such

that TTS systems are more common in a number of applications. Probably the main

use of TTS today is in call-centre automation, where a user calls to pay an electricity

bill or book some travel and conducts the entire transaction through an automatic

dialogue system beyond this, TTS systems have been used for reading news stories,

weather reports, travel directions and a wide variety of other applications.

Other than engineering aspects of text-to-speech, it is worth commenting that research

in this field has contributed an enormous amount to our general understanding of

language. Often this has been in the form of negative evidence, meaning that when

a theory thought to be true was implemented in a TTS system it was shown to be

false; in fact as we shall see, many linguistic theories have fallen when rigorously

tested in speech systems. More positively, TTS systems have made good testing

grounds for many models and theories, and TTS systems are certainly interesting in

their own terms, without reference to application or use.


6/19


7

General Anatomy of a Synthesizer

Within Festival we can identify three basic parts of the TTS process

Text analysis:

From raw text to identified words and basic utterances.

Linguistic analysis:

Finding pronunciations of the words and assigning prosodic structure to them:Phrasing, intonation and durations.

Waveform generation:

From a fully specified form (pronunciation and prosody) generate a waveform.

These partitions are not absolute, but they are a good way of chunking the problem.

Of course, different waveform generation techniques may need different types of

information. Pronunciation may not always use standard phones, and intonation need

not necessarily mean an F0 contour. For the main part, at along least the path which is

likely to generate a working voice, rather than the more research oriented techniques

described, the above three sections will be fairly cleanly adhered to.

There is another part to TTS which is normally not mentioned, we will mention it here

as it is the most important aspect of Festival that makes building of new voices

possible -- the system architecture . Festival provides a basic utterance structure, a

language to manipulate it, and methods for construction and deletion; it also interacts

with your audio system in an efficient way, spooling audio files while the rest of the

synthesis process can continue. With the Edinburgh Speech Tools, it offers basic

analysis tools (pitch trackers, classification tree builders, waveform I/O etc) and a

simple but powerful scripting language. All of these functions make it so that you

may get on with the task of building a voice, rather than worrying about the

underlying software too much.


7/19


8

Core system

In Festival we make a _rm distinction between the core system and the modules

which actually perform speech synthesis tasks. The core system, which includes the

architecture is written completely in C++ and doesn't change. Modules on the other

hand can be written in C++ or scheme and can be added or taken away form the

system with minimal disruption (adding or taking away a C++ module requires a re-

linking of course, but not a major recompilation). The decision of which language to

write a module in is largely a matter of choice. Because of the interpreter aspect, it is

usually easiest to develop modules in scheme and maybe for effciency re-write themin C++ after they are stable. Some types of programming (e.g. arrays in C++,

recursion in scheme) are more natural in one language than another. Finally,

depending on personal experience, some programmers simply prefer one to the other.


8/19


9

Architecture

Speech synthesis systems require ways of storing the various types of

linguistic information produced in the process of converting the input format

(e.g. text) into speech. The are a number of design considerations which

builders of synthesis architectures must take into account. We have listed

these as follows:

Simple linguistic objects such as words, phones, syllables and phrases need to

be represented.

A co-indexing mechanism is needed whereby given a word, one can _nd the

phones that comprise this word.

Changes should be localized. For example, a change in the syllabication

algorithm should only require changing those parts of the program directly

involved with syllabication - other modules should not be affected.

There should be no redundancy or duplication of information in the system.

For instance, it is common to want to know the start and end times of

linguistic entities. The end time of a word will be the same as the end time of

the last phone in that word. If this information is stored separately for phone

and word, problems will occur if one value is changed as the other will then

become out of date.

By their very nature, multi-lingual systems must support a wide variety of

linguistic theories. Hence the architecture should not be tied to any particular

linguistic theory or formalism.

The architecture should be fast and ef_cient as the synthesizer is intended for

real-time use.

Most importantly, the real purpose of the architecture is to allow

speech synthesis algorithms to be written as easily as possible. It is therefore

important that the architecture should be unobtrusive and provide the sorts of


9/19


10

structures and information that synthesis algorithms need. All programs must deal

with infrastructure issues such as data storage, _le i/o, memory allocation etc. If left

unchecked, algorithms can easily become bogged down with this sort of code, which

becomes intertwined with the actual algorithm itself. A good architecture will abstract

the infrastructure to such an extent that these aspects are hidden, so that synthesis

algorithms can be easily written and read without other issues getting in the way. The

interface should be easy to use, and make the writing of synthesis algorithms easier

and quicker than by purely ad-hoc methods.

String processing

Many early synthesis systems used what has been referred to as a string re-writing

mechanism as their central data structure. In this formalism, the linguistic

representation of an utterance is stored as a string. Initially, the string contains text,

which is then re-written or embellished with extra symbols as processing takes place.

Systems such as MITalk [1] and the CSTR Alvey synthesizer [5] used this method.

There are many short comings in this formalism which have been previously

recognized .The main problem with the string formalism is that it soon becomes

unwieldy for anything apart from the most trivial of tasks. Often the string becomesvery complex with words, phrase symbols, stress symbols, phones etc all mixed in

together. There are two main ways in which modules process such strings. Modules

can work on this string directly, but the interpretation of the symbols often gets in the

way of the algorithm itself. Alternatively, a module can parse the string into an

internal format and let the algorithms use that. Although this may simplify the writing

of the algorithms themselves, this is a very unwieldy approach as it means the string

has to be parsed every time a module is called. Moreover, this often leads to each

module having an individual internal data structure, which is unattractive from aprogramming point of view as new structures and techniques have to be learnt to

understand the workings of any new module.

To lessen these problems, information is often deleted from the string so as to keep

only what is perceived as essential information. For instance, after the grapheme to


10/19


11

phoneme conversion, the orthographic form of the word may be deleted from the

string. This can have unfortunate consequences in that information which might

potentially be of use to a module may have been deleted by an earlier module.

Multi-level data structures

In recognition of these shortcomings, many current systems have abandoned string

based processing and now use multi-level data structures (MLDS). The most famous

of these systems, known as Delta, was developed by Hertz [6], but many other

systems, such as Chatr [3], the Bell labs system [7], Polyglot [4], and early versions of

Festival [2] use similar formalisms. In the multi-layered formalism, different types of

linguistic information are held in separate streams which are linear lists or arrays of

linguistic items. For example we may have a word stream, a phone stream and a

syllable stream. Some systems have a fixed set of these while others an allow

arbitrary number. Often algorithms need to know what phones are related to a given

word, and hence the streams must be co-indexed. There are two main types of co-

indexing. In Delta, streams are aligning by the edges of items. To end the phones in a

word, one goes to the beginning of the word, traces the edge down to the phone

stream, and they progresses along the phone stream until the edge relating to the end

of the word is found. An alternative strategy is to align by the centers of items. In this

case, a word contains a set of links to the phones that are related to it, and the phones

in a word can be found by following these links. While multi-level data structures are

far preferable to string structures they still have serious drawbacks. The main

drawback stems from the fact that this forces all information to be represented by

linear structures: other types of structure, specifically trees, are very hard to represent.

Partly because of this, the number of streams in a system can become considerable

which leads to difficulties in co-indexing the items in streams. With the delta co

indexing, it is often the case that for a given item in one stream, there is no

corresponding item in other streams. For example, a pause is represented by an item

in the phone stream, but there is no equivalent item in the syllable or word streams.

Thus a hole must be created in these streams to ensure the co-indexing works.


11/19


12

With a large number of streams, the number of holes can become considerable which

makes processing awkward. In the centrelinking paradigm, the hole problem is absent,

but because each item must be explicitly linked to items in other streams, the number

of links can become very large. Furthermore, if a new stream is added, one would

have to link every existing item to the items in the new stream to ensure full

connectivity. As this isn't possible in practice, the streams are often left partially

connected. This can cause confusion in writing module, as one may be unsure of what

streams are linked to what. Another more subtle problem occurs in multi-level

structures due to a lack of clarity as to what an item really represents. Often (not

always) each item has a single value, typically a name. While it is obvious that a

suitable phone name could be something like /h/ or /e/ and a word .hello., what shouldthe name of a syllable be? Commonly, syllables are regarded as organisational units,

which serve to group together phones. As such, they don't have a distinct name. One

could group together the names of the phones and make that the syllable name (e.g.

./h e l/. for the first syllable of .hello.), but this is redundant, somewhat artificial and

likely to cause errors if the phone representation is changed.

Relations

We used the term relation to represent our generalization of the stream concept. A

relation is a data structure which is used to organize linguistic items into linguistic

structures such as trees and lists. A relation is a set of named links connecting a set of

nodes . Lists (streams in the MLDS) are linear relations where each node has a

previous and next link. Nodes in trees have up and down links also. Nodes are purely

positional units, and contain no information apart from their links. In addition to

having links to other nodes, each node has a single link to an item, which contains the

linguistic information.


12/19


13

Utterances, Relations, Items

Items & Features

Items consist of a bundle of features and a set of named links to nodes in relations.

Items can be linked into any number of relations - in the above example words were

linked into 2 relations, but in principle any number of relations is possible.

The information in items themselves are represented by features which are stored as a

list of key-value pairs. Feature values are commonly numbers or strings, but may also

take complex objects as values if necessary. A item can have arbitrarily many

features, including zero: syllable items often have no features at all, for example. As

well as simple values, features can also take functions as values. Function features are

a powerful facility which can help to greatly reduce the amount of redundant


13/19


14

information in a utterance structure. Consider the case of a simple phone

segmentation of an utterance, which comprises a contiguous list of named segments

each with timing information. If the phones are properly contiguous, only one timing

value for each phone is needed to fully represent all the timing information of the

segmentation. If the end point is used, the start time of an item will be equal to the end

point of the previous item and the duration will be equal to the start subtracted from

the end. However, it is often useful to have access to the start times and duration of

the phone also. Without the use of function features, there are two choices. Either the

end information alone can be stored and calculations can be done on the fly to

compute start times and durations, or else the start and durations can be written in as

additional features to the item. The first solution is unattractive as this can make

algorithm writing unwieldy and overly complicated. While the second solution will

may make algorithm writing easier, it involves effectively copying information, which

can lead to out of date information being present. Function features provide a neat

solution to this problem.

Although the function feature implementation is very

efficient in Festival, it can still sometimes be expense to constantly evaluate these

functions in the middle of a time critical loop. A global evaluation facility is therefore

provided which can evaluate all the feature functions in items in a relation and re-

write the features with their evaluation. After the loop, these can be discarded and the

feature functions used again, thus ensuring little chance of out of date information

being present.

Utterances

Utterance structures are collections of relations. The best way to visualize an

utterance is to think of it containing a unordered set of items, each of which is made

up from a set of features. Relations are then structures comprising of nodes, each of

which indexes into an item. An utterance structure simply collects these together to

form a single object.


14/19


15

Text Analysis

Text analysis is the task of identifying the words in the text. By words , we meantokens for which there is a well defined method of finding their pronunciation, i.e.

from a lexicon, or using letter-to-sound rules. The first task in text analysis is to make

chunks out of the input text -- tokenizing it. In Festival, at this stage, we also chunk

the text into more reasonably sized utterances. An utterance structure is used to hold

the information for what might most simply be described as a sentence . We use the

term loosely, as it need not be anything syntactic in the traditional linguistic sense,

though it most often has prosodic boundaries or edge effects. Separating a text into

utterances is important, as it allows synthesis to work bit by bit, allowing thewaveform of the first utterance to be available more quickly than if the whole files

was processed as one. Otherwise, one would simply play an entire recorded utterance

-- which is not nearly as flexible and in some domains is even impossible.

Utterance chunking is an externally specifiable

part of Festival, as it may vary from language to language. For many languages,

tokens are white-space separated and utterances can, to a first approximation, be

separated after full stops (periods), question marks, or exclamation points. Further

complications, such as abbreviations, other end punctuation (as the upside-down

question mark in Spanish), blank lines and so on, make the definition harder. For

languages such as Japanese and Chinese, where white space is not normally used to

separate what we would term words, a different strategy must be used, though both

these languages still use punctuation that can be used to identify utterance boundaries,

and word segmentation can be a second process.

Apart from chunking, text analysis also does text normalization .

There are many tokens which appear in text that do not have a direct relationship to

their pronunciation. Numbers are perhaps the most obvious example. Consider the

following sentence

On May 5 1996, the university bought 1996 computers.

In English, tokens consisting of solely digits have a number of different forms of

pronunciation. The 5 above is pronounced fifth., an ordinal, because it is the day in a


15/19


16

month, The first 1996 . is pronounced as . nineteen ninety six . because it is a year, and

the second 1996 . is pronounced as one thousand nine hundred and ninety size .

(British English) as it is a quantity. Two problems that turn up here: non-trivial

relationship of tokens to words, and homographs , where the same token may have

alternate pronunciations in different contexts. In Festival, homograph disambiguation

is considered as part of text analysis. In addition to numbers, there are many other

symbols which have internal structure that require special processing -- such as

money, times, addresses, etc. All of these can be dealt with in Festival by what is

termed token-to-word rules . These are language specific (and sometimes text mode

specific).

Linguistic/Prosodic Analysis

In this analysis finding pronunciations of the words and assigning prosodic structure

to them: phrasing, intonation and durations are actually taking place. It is possible to

find the pronunciation of a word either by a lexicon (a large list of words and their

pronunciations) or by some method of letter to sound rules.

Lexicons

After we have a set of words to be spoken, we have to decide what the sounds should

be -- what phonemes, or basic speech sounds, are spoken. Each language and dialect

has a phoneme set associated with it, and the choice of this inventory is still not

agreed upon; different theories posit different feature geometries. Given a set of units,

we can, once again, train models from them, but it is up to linguistics (and practice) to

help us find good levels of structure and the units at each.

Lexicons and addenda

The basic assumption in Festival is that you will have a large lexicon, tens of

thousands of entries, that is a used as a standard part of an implementation of a voice.

Letter-to-sound rules are used as back up when a word is not explicitly listed. This

view is based on how English is best dealt with. However this is a very flexible view,


16/19


17

An explicit lexicon isn't necessary in Festival and it may be possible to do much of

the work in letter-to-sound rules. This is how we have implemented Spanish.

However even when there is strong relationship between the letters in a word and

their pronunciation we still find a lexicon useful. For Spanish we still use the lexicon

for symbols such as . $., . %., individual letters, as well as irregular pronunciations. In

addition to a large lexicon Festival also supports a smaller list called addenda this is

primarily provided to allow specific applications and users to add entries that aren't in

the existing lexicon.

Letter to sound rules

It is context sensitive rules. These rules can be generated by hand or automatically. In

Festival there is a letter to sound rule system that allows rules to be written, but we

also provided a method for building rule sets automatically which will often be more

useful. The choice of using hand-written or automatically trained rules depends on the

language you are dealing with and the relationship it has between its orthography and

its phone set.

Post-lexical rules

In fluent speech word boundaries are often degraded in a way that causes co-

articulation across boundaries. A lexical entry should normally providepronunciations as if the word is being spoken in isolation. It is only once the word has

been inserted into the the context in which it is going to spoken can co-articulary

effects be applied. Post lexical rules are a general set of rules which can modify the

segment relation (or any other part of the utterance for that matter), after the basic

pronunciations have been found. In Festival post-lexical rules are defined as functions

which will be applied to the utterance after into national accents have been assigned.


17/19


18

Waveform Generation

Traditionally we can identify three different methods that are used for creatingwaveforms from such descriptions.

Articulatory synthesis is a method where the human vocal tract, tongue, lips, etc are

modeled in a computer. This method can produce very natural sounding samples of

speech but at present it requires too much computational power to be used in pratical

systems.

Formant synthesis is a technique where the speech signal is broken down into

iverlapping parts, such as formats (hence the name), voicing, asperation, nasality etc.The output of individual generators is then composed from streams of parameters

describing the each sub-part of the speech. With carefully tuned parameters the

quality

of the speech can be close to that of natural speech, showing that this encoding is

sufficient to represent human speech. However automatic prediction of these

parameters is harder and the typical quality of synthesizers based on this technology

can be very understandable but still have a distinct no-human or robotic quality.

Format synthesizers are best typified MITalk, the work of Dennis Klatt, and what was

later commercialized as DECTalk. DECTalk is at present probably the most familiar

form of speech synthesis to the world. Dr Stephen Hawkins, Lucindian Professor of

Mathematics at Cambridge University, has lost the use of his voice and uses a formant

synthesizer for his speech.

Concatenative synthesis is the third form which we will spend the most time on. Here

waveforms are created by concatenating parts of natural speech record from humans.


18/19


19

Diphone Synthesis

A diaphone is an adjacent pair of phones.

Here the main tasks included are:

Combining phones Recording diaphones Labeling the diaphones

Recording the diphones

The object of recording diphones is to get as uniform a set of pronunciations as

possible. Your speaker should be relaxed, not be suffering for a cold, or cough, or a

hangover. If something goes wrong with the recording and some of the examples need

to be re-recorded it is important that the speaker has as similar a voice as with the

original recording, waiting for another cold to come along is not reasonable, (though

some may argue that the same hangover can easily be induced). Also to try to keep

the voice potentially repeatable it is wise to record at the same time of day, morning is

a good idea. The points on speaker selection and recording in the previous section

should also be borne in mind.

Labeling the diphones

Labeling nonsense words is much easier than labeling continuous speech, whether it is

by hand or automatically. With nonsense words, it is completely defined which

phones are there and they are (hopefully) clearly articulated.


19/19


20

CONCLUSION

Festival offers a general framework for building speech synthesis systems as well as

including examples of various modules. As a whole it offers full text to speech

through a number APIs: from shell level, though a Scheme command interpreter, as a

C++ library, from Java, and an Emacs interface. Festival is multi-lingual (currently

English (British and American), and Spanish) though English is the most advanced.

Festival is constantly being improved and added to in continuing our research in

speech synthesis at CSTR

REFERENCE

http://www.cstr.ed.ac.uk/projects/festival/

http://www.speech.cs.cmu.edu/festival/manual-1.4.1/festival_1.html

Documents

Festival Speech Synthesis System