Upload
puspala-manojkumar
View
244
Download
0
Embed Size (px)
Citation preview
8/14/2019 Festival Speech Synthesis System
1/19
8/14/2019 Festival Speech Synthesis System
2/19
Festival Speech Synthesis System
3
ABSTRACT
The Festival Speech Synthesis Systems was developed at the Centre for Speech
Technology. Research at the University of Edinburgh3 in the late 90's. It offers a free,
portable, language independent; run-time speech synthesis engine for various
platforms under various APIs.Festival is just the engine that we will use in the process
of building voices, both as a run-time engine for the voices we build and as a tool in
the building process itself.
Festival offers a general framework for building speech synthesis systems as well as
including examples of various modules. As a whole it offers full text to speech
through a number APIs: from shell level, though a Scheme command interpreter, as a
C++ library, from Java, and an Emacs interface. Festival is multi-lingual (currently
English (British and American), and Spanish) though English is the most advanced.
Other groups release new languages for the system.
8/14/2019 Festival Speech Synthesis System
3/19
Festival Speech Synthesis System
4
CONTENTS
1. INTRODUCTION.5
2. TEXT TO SPEECH SYSTEMS.. .6
3. GENERAL ANATOMY OF A SYNTHESIZER.........7
4. CORE SYSTEM....8
5. ARCHITECTURE.. 10
5.1 String processing...105.2 multi-level data structures.....115.3 Relations125.4 Items&Features..135.5 Utterences..13
7. TEXT ANALYSIS.14
8. Linguistic/Prosodic Analysis...16
8.1 Lexicons168.2: Lexicons and addenda168.3 Letter to sound rules...178.4 Post lexical rules.17
9. WAVEFORM GENERATION..18
10. DIPHONE SYNTHESIS..19
10.1 Recording diphones19.10.2 labeling diphones....19
8. CONCLUSION...20
8/14/2019 Festival Speech Synthesis System
4/19
Festival Speech Synthesis System
5
INTRODUCTION
The idea that a machine could generate speech has been with us for some time, but
the realization of such machines has only really been practical within the last 50
years. Even more recently, it's in the last 20 years or so that we've seen practical
examples of text-to-speech systems that can say any text they're given -- though itmight be "wrong."
The creation of synthetic speech covers a whole range of processes, and though often
they are all lumped under the general term text-to-speech , a good deal of work has
gone into generating speech from sequences of speech sounds; this would be a
speech-sound (phoneme) to audio waveform synthesis, rather than going all the way
from text to phonemes (speech sounds), and then to sound.
One of the first practical applications of speech synthesis was in 1936 when the U.K.
Telephone Company introduced a speaking clock. It used optical storage for the
phrases, words, and part-words ("noun," "verb," and so on) which were appropriately
concatenated to form complete sentences.
While speech and language were already important parts of daily life before the
invention of the computer, the equipment and technology that has developed over the
last several years has made it possible to have machines that speak, read, or even
carry out dialogs. A number of vendors provide both recognition and speech
technology, and there are several telephone-based systems that do interesting things.
The Festival Speech Synthesis Systems is a free, portable, language independent, run-
time speech synthesis engine for various platforms under various APIs.
8/14/2019 Festival Speech Synthesis System
5/19
Festival Speech Synthesis System
6
TEXT-TO-SPEECH SYSTEMS
Text-to-speech systems have an enormous range of applications.
Their first real use was in reading systems for the blind, where a system would read
some text from a book and convert it into speech. These early systems of course
sounded very mechanical, but their adoption by blind people was hardly surprising as
the other options of reading Braille or having a real person do the reading were often
not possible. Today, quite sophisticated systems exist that facilitate human computer
interaction for the blind, in which the TTS can help the user, navigate around a
windows system.
The mainstream adoption of TTS has been severely limited by its quality. Apart from
users who have little choice (as in the case with blind people), peoples reaction to old
style TTS is not particularly positive. While people may be somewhat impressed and
quite happy to listen to a few sentences, in general the novelty of this soon wears off.
In recent years, the considerable advances in quality have changed the situation such
that TTS systems are more common in a number of applications. Probably the main
use of TTS today is in call-centre automation, where a user calls to pay an electricity
bill or book some travel and conducts the entire transaction through an automatic
dialogue system beyond this, TTS systems have been used for reading news stories,
weather reports, travel directions and a wide variety of other applications.
Other than engineering aspects of text-to-speech, it is worth commenting that research
in this field has contributed an enormous amount to our general understanding of
language. Often this has been in the form of negative evidence, meaning that when
a theory thought to be true was implemented in a TTS system it was shown to be
false; in fact as we shall see, many linguistic theories have fallen when rigorously
tested in speech systems. More positively, TTS systems have made good testing
grounds for many models and theories, and TTS systems are certainly interesting in
their own terms, without reference to application or use.
8/14/2019 Festival Speech Synthesis System
6/19
Festival Speech Synthesis System
7
General Anatomy of a Synthesizer
Within Festival we can identify three basic parts of the TTS process
Text analysis:
From raw text to identified words and basic utterances.
Linguistic analysis:
Finding pronunciations of the words and assigning prosodic structure to them:Phrasing, intonation and durations.
Waveform generation:
From a fully specified form (pronunciation and prosody) generate a waveform.
These partitions are not absolute, but they are a good way of chunking the problem.
Of course, different waveform generation techniques may need different types of
information. Pronunciation may not always use standard phones, and intonation need
not necessarily mean an F0 contour. For the main part, at along least the path which is
likely to generate a working voice, rather than the more research oriented techniques
described, the above three sections will be fairly cleanly adhered to.
There is another part to TTS which is normally not mentioned, we will mention it here
as it is the most important aspect of Festival that makes building of new voices
possible -- the system architecture . Festival provides a basic utterance structure, a
language to manipulate it, and methods for construction and deletion; it also interacts
with your audio system in an efficient way, spooling audio files while the rest of the
synthesis process can continue. With the Edinburgh Speech Tools, it offers basic
analysis tools (pitch trackers, classification tree builders, waveform I/O etc) and a
simple but powerful scripting language. All of these functions make it so that you
may get on with the task of building a voice, rather than worrying about the
underlying software too much.
8/14/2019 Festival Speech Synthesis System
7/19
Festival Speech Synthesis System
8
Core system
In Festival we make a _rm distinction between the core system and the modules
which actually perform speech synthesis tasks. The core system, which includes the
architecture is written completely in C++ and doesn't change. Modules on the other
hand can be written in C++ or scheme and can be added or taken away form the
system with minimal disruption (adding or taking away a C++ module requires a re-
linking of course, but not a major recompilation). The decision of which language to
write a module in is largely a matter of choice. Because of the interpreter aspect, it is
usually easiest to develop modules in scheme and maybe for effciency re-write themin C++ after they are stable. Some types of programming (e.g. arrays in C++,
recursion in scheme) are more natural in one language than another. Finally,
depending on personal experience, some programmers simply prefer one to the other.
8/14/2019 Festival Speech Synthesis System
8/19
Festival Speech Synthesis System
9
Architecture
Speech synthesis systems require ways of storing the various types of
linguistic information produced in the process of converting the input format
(e.g. text) into speech. The are a number of design considerations which
builders of synthesis architectures must take into account. We have listed
these as follows:
Simple linguistic objects such as words, phones, syllables and phrases need to
be represented.
A co-indexing mechanism is needed whereby given a word, one can _nd the
phones that comprise this word.
Changes should be localized. For example, a change in the syllabication
algorithm should only require changing those parts of the program directly
involved with syllabication - other modules should not be affected.
There should be no redundancy or duplication of information in the system.
For instance, it is common to want to know the start and end times of
linguistic entities. The end time of a word will be the same as the end time of
the last phone in that word. If this information is stored separately for phone
and word, problems will occur if one value is changed as the other will then
become out of date.
By their very nature, multi-lingual systems must support a wide variety of
linguistic theories. Hence the architecture should not be tied to any particular
linguistic theory or formalism.
The architecture should be fast and ef_cient as the synthesizer is intended for
real-time use.
Most importantly, the real purpose of the architecture is to allow
speech synthesis algorithms to be written as easily as possible. It is therefore
important that the architecture should be unobtrusive and provide the sorts of
8/14/2019 Festival Speech Synthesis System
9/19
Festival Speech Synthesis System
10
structures and information that synthesis algorithms need. All programs must deal
with infrastructure issues such as data storage, _le i/o, memory allocation etc. If left
unchecked, algorithms can easily become bogged down with this sort of code, which
becomes intertwined with the actual algorithm itself. A good architecture will abstract
the infrastructure to such an extent that these aspects are hidden, so that synthesis
algorithms can be easily written and read without other issues getting in the way. The
interface should be easy to use, and make the writing of synthesis algorithms easier
and quicker than by purely ad-hoc methods.
String processing
Many early synthesis systems used what has been referred to as a string re-writing
mechanism as their central data structure. In this formalism, the linguistic
representation of an utterance is stored as a string. Initially, the string contains text,
which is then re-written or embellished with extra symbols as processing takes place.
Systems such as MITalk [1] and the CSTR Alvey synthesizer [5] used this method.
There are many short comings in this formalism which have been previously
recognized .The main problem with the string formalism is that it soon becomes
unwieldy for anything apart from the most trivial of tasks. Often the string becomesvery complex with words, phrase symbols, stress symbols, phones etc all mixed in
together. There are two main ways in which modules process such strings. Modules
can work on this string directly, but the interpretation of the symbols often gets in the
way of the algorithm itself. Alternatively, a module can parse the string into an
internal format and let the algorithms use that. Although this may simplify the writing
of the algorithms themselves, this is a very unwieldy approach as it means the string
has to be parsed every time a module is called. Moreover, this often leads to each
module having an individual internal data structure, which is unattractive from aprogramming point of view as new structures and techniques have to be learnt to
understand the workings of any new module.
To lessen these problems, information is often deleted from the string so as to keep
only what is perceived as essential information. For instance, after the grapheme to
8/14/2019 Festival Speech Synthesis System
10/19
Festival Speech Synthesis System
11
phoneme conversion, the orthographic form of the word may be deleted from the
string. This can have unfortunate consequences in that information which might
potentially be of use to a module may have been deleted by an earlier module.
Multi-level data structures
In recognition of these shortcomings, many current systems have abandoned string
based processing and now use multi-level data structures (MLDS). The most famous
of these systems, known as Delta, was developed by Hertz [6], but many other
systems, such as Chatr [3], the Bell labs system [7], Polyglot [4], and early versions of
Festival [2] use similar formalisms. In the multi-layered formalism, different types of
linguistic information are held in separate streams which are linear lists or arrays of
linguistic items. For example we may have a word stream, a phone stream and a
syllable stream. Some systems have a fixed set of these while others an allow
arbitrary number. Often algorithms need to know what phones are related to a given
word, and hence the streams must be co-indexed. There are two main types of co-
indexing. In Delta, streams are aligning by the edges of items. To end the phones in a
word, one goes to the beginning of the word, traces the edge down to the phone
stream, and they progresses along the phone stream until the edge relating to the end
of the word is found. An alternative strategy is to align by the centers of items. In this
case, a word contains a set of links to the phones that are related to it, and the phones
in a word can be found by following these links. While multi-level data structures are
far preferable to string structures they still have serious drawbacks. The main
drawback stems from the fact that this forces all information to be represented by
linear structures: other types of structure, specifically trees, are very hard to represent.
Partly because of this, the number of streams in a system can become considerable
which leads to difficulties in co-indexing the items in streams. With the delta co
indexing, it is often the case that for a given item in one stream, there is no
corresponding item in other streams. For example, a pause is represented by an item
in the phone stream, but there is no equivalent item in the syllable or word streams.
Thus a hole must be created in these streams to ensure the co-indexing works.
8/14/2019 Festival Speech Synthesis System
11/19
Festival Speech Synthesis System
12
With a large number of streams, the number of holes can become considerable which
makes processing awkward. In the centrelinking paradigm, the hole problem is absent,
but because each item must be explicitly linked to items in other streams, the number
of links can become very large. Furthermore, if a new stream is added, one would
have to link every existing item to the items in the new stream to ensure full
connectivity. As this isn't possible in practice, the streams are often left partially
connected. This can cause confusion in writing module, as one may be unsure of what
streams are linked to what. Another more subtle problem occurs in multi-level
structures due to a lack of clarity as to what an item really represents. Often (not
always) each item has a single value, typically a name. While it is obvious that a
suitable phone name could be something like /h/ or /e/ and a word .hello., what shouldthe name of a syllable be? Commonly, syllables are regarded as organisational units,
which serve to group together phones. As such, they don't have a distinct name. One
could group together the names of the phones and make that the syllable name (e.g.
./h e l/. for the first syllable of .hello.), but this is redundant, somewhat artificial and
likely to cause errors if the phone representation is changed.
Relations
We used the term relation to represent our generalization of the stream concept. A
relation is a data structure which is used to organize linguistic items into linguistic
structures such as trees and lists. A relation is a set of named links connecting a set of
nodes . Lists (streams in the MLDS) are linear relations where each node has a
previous and next link. Nodes in trees have up and down links also. Nodes are purely
positional units, and contain no information apart from their links. In addition to
having links to other nodes, each node has a single link to an item, which contains the
linguistic information.
8/14/2019 Festival Speech Synthesis System
12/19
Festival Speech Synthesis System
13
Utterances, Relations, Items
Items & Features
Items consist of a bundle of features and a set of named links to nodes in relations.
Items can be linked into any number of relations - in the above example words were
linked into 2 relations, but in principle any number of relations is possible.
The information in items themselves are represented by features which are stored as a
list of key-value pairs. Feature values are commonly numbers or strings, but may also
take complex objects as values if necessary. A item can have arbitrarily many
features, including zero: syllable items often have no features at all, for example. As
well as simple values, features can also take functions as values. Function features are
a powerful facility which can help to greatly reduce the amount of redundant
8/14/2019 Festival Speech Synthesis System
13/19
Festival Speech Synthesis System
14
information in a utterance structure. Consider the case of a simple phone
segmentation of an utterance, which comprises a contiguous list of named segments
each with timing information. If the phones are properly contiguous, only one timing
value for each phone is needed to fully represent all the timing information of the
segmentation. If the end point is used, the start time of an item will be equal to the end
point of the previous item and the duration will be equal to the start subtracted from
the end. However, it is often useful to have access to the start times and duration of
the phone also. Without the use of function features, there are two choices. Either the
end information alone can be stored and calculations can be done on the fly to
compute start times and durations, or else the start and durations can be written in as
additional features to the item. The first solution is unattractive as this can make
algorithm writing unwieldy and overly complicated. While the second solution will
may make algorithm writing easier, it involves effectively copying information, which
can lead to out of date information being present. Function features provide a neat
solution to this problem.
Although the function feature implementation is very
efficient in Festival, it can still sometimes be expense to constantly evaluate these
functions in the middle of a time critical loop. A global evaluation facility is therefore
provided which can evaluate all the feature functions in items in a relation and re-
write the features with their evaluation. After the loop, these can be discarded and the
feature functions used again, thus ensuring little chance of out of date information
being present.
Utterances
Utterance structures are collections of relations. The best way to visualize an
utterance is to think of it containing a unordered set of items, each of which is made
up from a set of features. Relations are then structures comprising of nodes, each of
which indexes into an item. An utterance structure simply collects these together to
form a single object.
8/14/2019 Festival Speech Synthesis System
14/19
Festival Speech Synthesis System
15
Text Analysis
Text analysis is the task of identifying the words in the text. By words , we meantokens for which there is a well defined method of finding their pronunciation, i.e.
from a lexicon, or using letter-to-sound rules. The first task in text analysis is to make
chunks out of the input text -- tokenizing it. In Festival, at this stage, we also chunk
the text into more reasonably sized utterances. An utterance structure is used to hold
the information for what might most simply be described as a sentence . We use the
term loosely, as it need not be anything syntactic in the traditional linguistic sense,
though it most often has prosodic boundaries or edge effects. Separating a text into
utterances is important, as it allows synthesis to work bit by bit, allowing thewaveform of the first utterance to be available more quickly than if the whole files
was processed as one. Otherwise, one would simply play an entire recorded utterance
-- which is not nearly as flexible and in some domains is even impossible.
Utterance chunking is an externally specifiable
part of Festival, as it may vary from language to language. For many languages,
tokens are white-space separated and utterances can, to a first approximation, be
separated after full stops (periods), question marks, or exclamation points. Further
complications, such as abbreviations, other end punctuation (as the upside-down
question mark in Spanish), blank lines and so on, make the definition harder. For
languages such as Japanese and Chinese, where white space is not normally used to
separate what we would term words, a different strategy must be used, though both
these languages still use punctuation that can be used to identify utterance boundaries,
and word segmentation can be a second process.
Apart from chunking, text analysis also does text normalization .
There are many tokens which appear in text that do not have a direct relationship to
their pronunciation. Numbers are perhaps the most obvious example. Consider the
following sentence
On May 5 1996, the university bought 1996 computers.
In English, tokens consisting of solely digits have a number of different forms of
pronunciation. The 5 above is pronounced fifth., an ordinal, because it is the day in a
8/14/2019 Festival Speech Synthesis System
15/19
Festival Speech Synthesis System
16
month, The first 1996 . is pronounced as . nineteen ninety six . because it is a year, and
the second 1996 . is pronounced as one thousand nine hundred and ninety size .
(British English) as it is a quantity. Two problems that turn up here: non-trivial
relationship of tokens to words, and homographs , where the same token may have
alternate pronunciations in different contexts. In Festival, homograph disambiguation
is considered as part of text analysis. In addition to numbers, there are many other
symbols which have internal structure that require special processing -- such as
money, times, addresses, etc. All of these can be dealt with in Festival by what is
termed token-to-word rules . These are language specific (and sometimes text mode
specific).
Linguistic/Prosodic Analysis
In this analysis finding pronunciations of the words and assigning prosodic structure
to them: phrasing, intonation and durations are actually taking place. It is possible to
find the pronunciation of a word either by a lexicon (a large list of words and their
pronunciations) or by some method of letter to sound rules.
Lexicons
After we have a set of words to be spoken, we have to decide what the sounds should
be -- what phonemes, or basic speech sounds, are spoken. Each language and dialect
has a phoneme set associated with it, and the choice of this inventory is still not
agreed upon; different theories posit different feature geometries. Given a set of units,
we can, once again, train models from them, but it is up to linguistics (and practice) to
help us find good levels of structure and the units at each.
Lexicons and addenda
The basic assumption in Festival is that you will have a large lexicon, tens of
thousands of entries, that is a used as a standard part of an implementation of a voice.
Letter-to-sound rules are used as back up when a word is not explicitly listed. This
view is based on how English is best dealt with. However this is a very flexible view,
8/14/2019 Festival Speech Synthesis System
16/19
Festival Speech Synthesis System
17
An explicit lexicon isn't necessary in Festival and it may be possible to do much of
the work in letter-to-sound rules. This is how we have implemented Spanish.
However even when there is strong relationship between the letters in a word and
their pronunciation we still find a lexicon useful. For Spanish we still use the lexicon
for symbols such as . $., . %., individual letters, as well as irregular pronunciations. In
addition to a large lexicon Festival also supports a smaller list called addenda this is
primarily provided to allow specific applications and users to add entries that aren't in
the existing lexicon.
Letter to sound rules
It is context sensitive rules. These rules can be generated by hand or automatically. In
Festival there is a letter to sound rule system that allows rules to be written, but we
also provided a method for building rule sets automatically which will often be more
useful. The choice of using hand-written or automatically trained rules depends on the
language you are dealing with and the relationship it has between its orthography and
its phone set.
Post-lexical rules
In fluent speech word boundaries are often degraded in a way that causes co-
articulation across boundaries. A lexical entry should normally providepronunciations as if the word is being spoken in isolation. It is only once the word has
been inserted into the the context in which it is going to spoken can co-articulary
effects be applied. Post lexical rules are a general set of rules which can modify the
segment relation (or any other part of the utterance for that matter), after the basic
pronunciations have been found. In Festival post-lexical rules are defined as functions
which will be applied to the utterance after into national accents have been assigned.
8/14/2019 Festival Speech Synthesis System
17/19
Festival Speech Synthesis System
18
Waveform Generation
Traditionally we can identify three different methods that are used for creatingwaveforms from such descriptions.
Articulatory synthesis is a method where the human vocal tract, tongue, lips, etc are
modeled in a computer. This method can produce very natural sounding samples of
speech but at present it requires too much computational power to be used in pratical
systems.
Formant synthesis is a technique where the speech signal is broken down into
iverlapping parts, such as formats (hence the name), voicing, asperation, nasality etc.The output of individual generators is then composed from streams of parameters
describing the each sub-part of the speech. With carefully tuned parameters the
quality
of the speech can be close to that of natural speech, showing that this encoding is
sufficient to represent human speech. However automatic prediction of these
parameters is harder and the typical quality of synthesizers based on this technology
can be very understandable but still have a distinct no-human or robotic quality.
Format synthesizers are best typified MITalk, the work of Dennis Klatt, and what was
later commercialized as DECTalk. DECTalk is at present probably the most familiar
form of speech synthesis to the world. Dr Stephen Hawkins, Lucindian Professor of
Mathematics at Cambridge University, has lost the use of his voice and uses a formant
synthesizer for his speech.
Concatenative synthesis is the third form which we will spend the most time on. Here
waveforms are created by concatenating parts of natural speech record from humans.
8/14/2019 Festival Speech Synthesis System
18/19
Festival Speech Synthesis System
19
Diphone Synthesis
A diaphone is an adjacent pair of phones.
Here the main tasks included are:
Combining phones Recording diaphones Labeling the diaphones
Recording the diphones
The object of recording diphones is to get as uniform a set of pronunciations as
possible. Your speaker should be relaxed, not be suffering for a cold, or cough, or a
hangover. If something goes wrong with the recording and some of the examples need
to be re-recorded it is important that the speaker has as similar a voice as with the
original recording, waiting for another cold to come along is not reasonable, (though
some may argue that the same hangover can easily be induced). Also to try to keep
the voice potentially repeatable it is wise to record at the same time of day, morning is
a good idea. The points on speaker selection and recording in the previous section
should also be borne in mind.
Labeling the diphones
Labeling nonsense words is much easier than labeling continuous speech, whether it is
by hand or automatically. With nonsense words, it is completely defined which
phones are there and they are (hopefully) clearly articulated.
8/14/2019 Festival Speech Synthesis System
19/19
Festival Speech Synthesis System
20
CONCLUSION
Festival offers a general framework for building speech synthesis systems as well as
including examples of various modules. As a whole it offers full text to speech
through a number APIs: from shell level, though a Scheme command interpreter, as a
C++ library, from Java, and an Emacs interface. Festival is multi-lingual (currently
English (British and American), and Spanish) though English is the most advanced.
Festival is constantly being improved and added to in continuing our research in
speech synthesis at CSTR
REFERENCE
http://www.cstr.ed.ac.uk/projects/festival/
http://www.speech.cs.cmu.edu/festival/manual-1.4.1/festival_1.html