Festival Speech Synthesis System

Embed Size (px)

Citation preview

  • 8/14/2019 Festival Speech Synthesis System

    1/19

  • 8/14/2019 Festival Speech Synthesis System

    2/19

    Festival Speech Synthesis System

    3

    ABSTRACT

    The Festival Speech Synthesis Systems was developed at the Centre for Speech

    Technology. Research at the University of Edinburgh3 in the late 90's. It offers a free,

    portable, language independent; run-time speech synthesis engine for various

    platforms under various APIs.Festival is just the engine that we will use in the process

    of building voices, both as a run-time engine for the voices we build and as a tool in

    the building process itself.

    Festival offers a general framework for building speech synthesis systems as well as

    including examples of various modules. As a whole it offers full text to speech

    through a number APIs: from shell level, though a Scheme command interpreter, as a

    C++ library, from Java, and an Emacs interface. Festival is multi-lingual (currently

    English (British and American), and Spanish) though English is the most advanced.

    Other groups release new languages for the system.

  • 8/14/2019 Festival Speech Synthesis System

    3/19

    Festival Speech Synthesis System

    4

    CONTENTS

    1. INTRODUCTION.5

    2. TEXT TO SPEECH SYSTEMS.. .6

    3. GENERAL ANATOMY OF A SYNTHESIZER.........7

    4. CORE SYSTEM....8

    5. ARCHITECTURE.. 10

    5.1 String processing...105.2 multi-level data structures.....115.3 Relations125.4 Items&Features..135.5 Utterences..13

    7. TEXT ANALYSIS.14

    8. Linguistic/Prosodic Analysis...16

    8.1 Lexicons168.2: Lexicons and addenda168.3 Letter to sound rules...178.4 Post lexical rules.17

    9. WAVEFORM GENERATION..18

    10. DIPHONE SYNTHESIS..19

    10.1 Recording diphones19.10.2 labeling diphones....19

    8. CONCLUSION...20

  • 8/14/2019 Festival Speech Synthesis System

    4/19

    Festival Speech Synthesis System

    5

    INTRODUCTION

    The idea that a machine could generate speech has been with us for some time, but

    the realization of such machines has only really been practical within the last 50

    years. Even more recently, it's in the last 20 years or so that we've seen practical

    examples of text-to-speech systems that can say any text they're given -- though itmight be "wrong."

    The creation of synthetic speech covers a whole range of processes, and though often

    they are all lumped under the general term text-to-speech , a good deal of work has

    gone into generating speech from sequences of speech sounds; this would be a

    speech-sound (phoneme) to audio waveform synthesis, rather than going all the way

    from text to phonemes (speech sounds), and then to sound.

    One of the first practical applications of speech synthesis was in 1936 when the U.K.

    Telephone Company introduced a speaking clock. It used optical storage for the

    phrases, words, and part-words ("noun," "verb," and so on) which were appropriately

    concatenated to form complete sentences.

    While speech and language were already important parts of daily life before the

    invention of the computer, the equipment and technology that has developed over the

    last several years has made it possible to have machines that speak, read, or even

    carry out dialogs. A number of vendors provide both recognition and speech

    technology, and there are several telephone-based systems that do interesting things.

    The Festival Speech Synthesis Systems is a free, portable, language independent, run-

    time speech synthesis engine for various platforms under various APIs.

  • 8/14/2019 Festival Speech Synthesis System

    5/19

    Festival Speech Synthesis System

    6

    TEXT-TO-SPEECH SYSTEMS

    Text-to-speech systems have an enormous range of applications.

    Their first real use was in reading systems for the blind, where a system would read

    some text from a book and convert it into speech. These early systems of course

    sounded very mechanical, but their adoption by blind people was hardly surprising as

    the other options of reading Braille or having a real person do the reading were often

    not possible. Today, quite sophisticated systems exist that facilitate human computer

    interaction for the blind, in which the TTS can help the user, navigate around a

    windows system.

    The mainstream adoption of TTS has been severely limited by its quality. Apart from

    users who have little choice (as in the case with blind people), peoples reaction to old

    style TTS is not particularly positive. While people may be somewhat impressed and

    quite happy to listen to a few sentences, in general the novelty of this soon wears off.

    In recent years, the considerable advances in quality have changed the situation such

    that TTS systems are more common in a number of applications. Probably the main

    use of TTS today is in call-centre automation, where a user calls to pay an electricity

    bill or book some travel and conducts the entire transaction through an automatic

    dialogue system beyond this, TTS systems have been used for reading news stories,

    weather reports, travel directions and a wide variety of other applications.

    Other than engineering aspects of text-to-speech, it is worth commenting that research

    in this field has contributed an enormous amount to our general understanding of

    language. Often this has been in the form of negative evidence, meaning that when

    a theory thought to be true was implemented in a TTS system it was shown to be

    false; in fact as we shall see, many linguistic theories have fallen when rigorously

    tested in speech systems. More positively, TTS systems have made good testing

    grounds for many models and theories, and TTS systems are certainly interesting in

    their own terms, without reference to application or use.

  • 8/14/2019 Festival Speech Synthesis System

    6/19

    Festival Speech Synthesis System

    7

    General Anatomy of a Synthesizer

    Within Festival we can identify three basic parts of the TTS process

    Text analysis:

    From raw text to identified words and basic utterances.

    Linguistic analysis:

    Finding pronunciations of the words and assigning prosodic structure to them:Phrasing, intonation and durations.

    Waveform generation:

    From a fully specified form (pronunciation and prosody) generate a waveform.

    These partitions are not absolute, but they are a good way of chunking the problem.

    Of course, different waveform generation techniques may need different types of

    information. Pronunciation may not always use standard phones, and intonation need

    not necessarily mean an F0 contour. For the main part, at along least the path which is

    likely to generate a working voice, rather than the more research oriented techniques

    described, the above three sections will be fairly cleanly adhered to.

    There is another part to TTS which is normally not mentioned, we will mention it here

    as it is the most important aspect of Festival that makes building of new voices

    possible -- the system architecture . Festival provides a basic utterance structure, a

    language to manipulate it, and methods for construction and deletion; it also interacts

    with your audio system in an efficient way, spooling audio files while the rest of the

    synthesis process can continue. With the Edinburgh Speech Tools, it offers basic

    analysis tools (pitch trackers, classification tree builders, waveform I/O etc) and a

    simple but powerful scripting language. All of these functions make it so that you

    may get on with the task of building a voice, rather than worrying about the

    underlying software too much.

  • 8/14/2019 Festival Speech Synthesis System

    7/19

    Festival Speech Synthesis System

    8

    Core system

    In Festival we make a _rm distinction between the core system and the modules

    which actually perform speech synthesis tasks. The core system, which includes the

    architecture is written completely in C++ and doesn't change. Modules on the other

    hand can be written in C++ or scheme and can be added or taken away form the

    system with minimal disruption (adding or taking away a C++ module requires a re-

    linking of course, but not a major recompilation). The decision of which language to

    write a module in is largely a matter of choice. Because of the interpreter aspect, it is

    usually easiest to develop modules in scheme and maybe for effciency re-write themin C++ after they are stable. Some types of programming (e.g. arrays in C++,

    recursion in scheme) are more natural in one language than another. Finally,

    depending on personal experience, some programmers simply prefer one to the other.

  • 8/14/2019 Festival Speech Synthesis System

    8/19

    Festival Speech Synthesis System

    9

    Architecture

    Speech synthesis systems require ways of storing the various types of

    linguistic information produced in the process of converting the input format

    (e.g. text) into speech. The are a number of design considerations which

    builders of synthesis architectures must take into account. We have listed

    these as follows:

    Simple linguistic objects such as words, phones, syllables and phrases need to

    be represented.

    A co-indexing mechanism is needed whereby given a word, one can _nd the

    phones that comprise this word.

    Changes should be localized. For example, a change in the syllabication

    algorithm should only require changing those parts of the program directly

    involved with syllabication - other modules should not be affected.

    There should be no redundancy or duplication of information in the system.

    For instance, it is common to want to know the start and end times of

    linguistic entities. The end time of a word will be the same as the end time of

    the last phone in that word. If this information is stored separately for phone

    and word, problems will occur if one value is changed as the other will then

    become out of date.

    By their very nature, multi-lingual systems must support a wide variety of

    linguistic theories. Hence the architecture should not be tied to any particular

    linguistic theory or formalism.

    The architecture should be fast and ef_cient as the synthesizer is intended for

    real-time use.

    Most importantly, the real purpose of the architecture is to allow

    speech synthesis algorithms to be written as easily as possible. It is therefore

    important that the architecture should be unobtrusive and provide the sorts of

  • 8/14/2019 Festival Speech Synthesis System

    9/19

    Festival Speech Synthesis System

    10

    structures and information that synthesis algorithms need. All programs must deal

    with infrastructure issues such as data storage, _le i/o, memory allocation etc. If left

    unchecked, algorithms can easily become bogged down with this sort of code, which

    becomes intertwined with the actual algorithm itself. A good architecture will abstract

    the infrastructure to such an extent that these aspects are hidden, so that synthesis

    algorithms can be easily written and read without other issues getting in the way. The

    interface should be easy to use, and make the writing of synthesis algorithms easier

    and quicker than by purely ad-hoc methods.

    String processing

    Many early synthesis systems used what has been referred to as a string re-writing

    mechanism as their central data structure. In this formalism, the linguistic

    representation of an utterance is stored as a string. Initially, the string contains text,

    which is then re-written or embellished with extra symbols as processing takes place.

    Systems such as MITalk [1] and the CSTR Alvey synthesizer [5] used this method.

    There are many short comings in this formalism which have been previously

    recognized .The main problem with the string formalism is that it soon becomes

    unwieldy for anything apart from the most trivial of tasks. Often the string becomesvery complex with words, phrase symbols, stress symbols, phones etc all mixed in

    together. There are two main ways in which modules process such strings. Modules

    can work on this string directly, but the interpretation of the symbols often gets in the

    way of the algorithm itself. Alternatively, a module can parse the string into an

    internal format and let the algorithms use that. Although this may simplify the writing

    of the algorithms themselves, this is a very unwieldy approach as it means the string

    has to be parsed every time a module is called. Moreover, this often leads to each

    module having an individual internal data structure, which is unattractive from aprogramming point of view as new structures and techniques have to be learnt to

    understand the workings of any new module.

    To lessen these problems, information is often deleted from the string so as to keep

    only what is perceived as essential information. For instance, after the grapheme to

  • 8/14/2019 Festival Speech Synthesis System

    10/19

    Festival Speech Synthesis System

    11

    phoneme conversion, the orthographic form of the word may be deleted from the

    string. This can have unfortunate consequences in that information which might

    potentially be of use to a module may have been deleted by an earlier module.

    Multi-level data structures

    In recognition of these shortcomings, many current systems have abandoned string

    based processing and now use multi-level data structures (MLDS). The most famous

    of these systems, known as Delta, was developed by Hertz [6], but many other

    systems, such as Chatr [3], the Bell labs system [7], Polyglot [4], and early versions of

    Festival [2] use similar formalisms. In the multi-layered formalism, different types of

    linguistic information are held in separate streams which are linear lists or arrays of

    linguistic items. For example we may have a word stream, a phone stream and a

    syllable stream. Some systems have a fixed set of these while others an allow

    arbitrary number. Often algorithms need to know what phones are related to a given

    word, and hence the streams must be co-indexed. There are two main types of co-

    indexing. In Delta, streams are aligning by the edges of items. To end the phones in a

    word, one goes to the beginning of the word, traces the edge down to the phone

    stream, and they progresses along the phone stream until the edge relating to the end

    of the word is found. An alternative strategy is to align by the centers of items. In this

    case, a word contains a set of links to the phones that are related to it, and the phones

    in a word can be found by following these links. While multi-level data structures are

    far preferable to string structures they still have serious drawbacks. The main

    drawback stems from the fact that this forces all information to be represented by

    linear structures: other types of structure, specifically trees, are very hard to represent.

    Partly because of this, the number of streams in a system can become considerable

    which leads to difficulties in co-indexing the items in streams. With the delta co

    indexing, it is often the case that for a given item in one stream, there is no

    corresponding item in other streams. For example, a pause is represented by an item

    in the phone stream, but there is no equivalent item in the syllable or word streams.

    Thus a hole must be created in these streams to ensure the co-indexing works.

  • 8/14/2019 Festival Speech Synthesis System

    11/19

    Festival Speech Synthesis System

    12

    With a large number of streams, the number of holes can become considerable which

    makes processing awkward. In the centrelinking paradigm, the hole problem is absent,

    but because each item must be explicitly linked to items in other streams, the number

    of links can become very large. Furthermore, if a new stream is added, one would

    have to link every existing item to the items in the new stream to ensure full

    connectivity. As this isn't possible in practice, the streams are often left partially

    connected. This can cause confusion in writing module, as one may be unsure of what

    streams are linked to what. Another more subtle problem occurs in multi-level

    structures due to a lack of clarity as to what an item really represents. Often (not

    always) each item has a single value, typically a name. While it is obvious that a

    suitable phone name could be something like /h/ or /e/ and a word .hello., what shouldthe name of a syllable be? Commonly, syllables are regarded as organisational units,

    which serve to group together phones. As such, they don't have a distinct name. One

    could group together the names of the phones and make that the syllable name (e.g.

    ./h e l/. for the first syllable of .hello.), but this is redundant, somewhat artificial and

    likely to cause errors if the phone representation is changed.

    Relations

    We used the term relation to represent our generalization of the stream concept. A

    relation is a data structure which is used to organize linguistic items into linguistic

    structures such as trees and lists. A relation is a set of named links connecting a set of

    nodes . Lists (streams in the MLDS) are linear relations where each node has a

    previous and next link. Nodes in trees have up and down links also. Nodes are purely

    positional units, and contain no information apart from their links. In addition to

    having links to other nodes, each node has a single link to an item, which contains the

    linguistic information.

  • 8/14/2019 Festival Speech Synthesis System

    12/19

    Festival Speech Synthesis System

    13

    Utterances, Relations, Items

    Items & Features

    Items consist of a bundle of features and a set of named links to nodes in relations.

    Items can be linked into any number of relations - in the above example words were

    linked into 2 relations, but in principle any number of relations is possible.

    The information in items themselves are represented by features which are stored as a

    list of key-value pairs. Feature values are commonly numbers or strings, but may also

    take complex objects as values if necessary. A item can have arbitrarily many

    features, including zero: syllable items often have no features at all, for example. As

    well as simple values, features can also take functions as values. Function features are

    a powerful facility which can help to greatly reduce the amount of redundant

  • 8/14/2019 Festival Speech Synthesis System

    13/19

    Festival Speech Synthesis System

    14

    information in a utterance structure. Consider the case of a simple phone

    segmentation of an utterance, which comprises a contiguous list of named segments

    each with timing information. If the phones are properly contiguous, only one timing

    value for each phone is needed to fully represent all the timing information of the

    segmentation. If the end point is used, the start time of an item will be equal to the end

    point of the previous item and the duration will be equal to the start subtracted from

    the end. However, it is often useful to have access to the start times and duration of

    the phone also. Without the use of function features, there are two choices. Either the

    end information alone can be stored and calculations can be done on the fly to

    compute start times and durations, or else the start and durations can be written in as

    additional features to the item. The first solution is unattractive as this can make

    algorithm writing unwieldy and overly complicated. While the second solution will

    may make algorithm writing easier, it involves effectively copying information, which

    can lead to out of date information being present. Function features provide a neat

    solution to this problem.

    Although the function feature implementation is very

    efficient in Festival, it can still sometimes be expense to constantly evaluate these

    functions in the middle of a time critical loop. A global evaluation facility is therefore

    provided which can evaluate all the feature functions in items in a relation and re-

    write the features with their evaluation. After the loop, these can be discarded and the

    feature functions used again, thus ensuring little chance of out of date information

    being present.

    Utterances

    Utterance structures are collections of relations. The best way to visualize an

    utterance is to think of it containing a unordered set of items, each of which is made

    up from a set of features. Relations are then structures comprising of nodes, each of

    which indexes into an item. An utterance structure simply collects these together to

    form a single object.

  • 8/14/2019 Festival Speech Synthesis System

    14/19

    Festival Speech Synthesis System

    15

    Text Analysis

    Text analysis is the task of identifying the words in the text. By words , we meantokens for which there is a well defined method of finding their pronunciation, i.e.

    from a lexicon, or using letter-to-sound rules. The first task in text analysis is to make

    chunks out of the input text -- tokenizing it. In Festival, at this stage, we also chunk

    the text into more reasonably sized utterances. An utterance structure is used to hold

    the information for what might most simply be described as a sentence . We use the

    term loosely, as it need not be anything syntactic in the traditional linguistic sense,

    though it most often has prosodic boundaries or edge effects. Separating a text into

    utterances is important, as it allows synthesis to work bit by bit, allowing thewaveform of the first utterance to be available more quickly than if the whole files

    was processed as one. Otherwise, one would simply play an entire recorded utterance

    -- which is not nearly as flexible and in some domains is even impossible.

    Utterance chunking is an externally specifiable

    part of Festival, as it may vary from language to language. For many languages,

    tokens are white-space separated and utterances can, to a first approximation, be

    separated after full stops (periods), question marks, or exclamation points. Further

    complications, such as abbreviations, other end punctuation (as the upside-down

    question mark in Spanish), blank lines and so on, make the definition harder. For

    languages such as Japanese and Chinese, where white space is not normally used to

    separate what we would term words, a different strategy must be used, though both

    these languages still use punctuation that can be used to identify utterance boundaries,

    and word segmentation can be a second process.

    Apart from chunking, text analysis also does text normalization .

    There are many tokens which appear in text that do not have a direct relationship to

    their pronunciation. Numbers are perhaps the most obvious example. Consider the

    following sentence

    On May 5 1996, the university bought 1996 computers.

    In English, tokens consisting of solely digits have a number of different forms of

    pronunciation. The 5 above is pronounced fifth., an ordinal, because it is the day in a

  • 8/14/2019 Festival Speech Synthesis System

    15/19

    Festival Speech Synthesis System

    16

    month, The first 1996 . is pronounced as . nineteen ninety six . because it is a year, and

    the second 1996 . is pronounced as one thousand nine hundred and ninety size .

    (British English) as it is a quantity. Two problems that turn up here: non-trivial

    relationship of tokens to words, and homographs , where the same token may have

    alternate pronunciations in different contexts. In Festival, homograph disambiguation

    is considered as part of text analysis. In addition to numbers, there are many other

    symbols which have internal structure that require special processing -- such as

    money, times, addresses, etc. All of these can be dealt with in Festival by what is

    termed token-to-word rules . These are language specific (and sometimes text mode

    specific).

    Linguistic/Prosodic Analysis

    In this analysis finding pronunciations of the words and assigning prosodic structure

    to them: phrasing, intonation and durations are actually taking place. It is possible to

    find the pronunciation of a word either by a lexicon (a large list of words and their

    pronunciations) or by some method of letter to sound rules.

    Lexicons

    After we have a set of words to be spoken, we have to decide what the sounds should

    be -- what phonemes, or basic speech sounds, are spoken. Each language and dialect

    has a phoneme set associated with it, and the choice of this inventory is still not

    agreed upon; different theories posit different feature geometries. Given a set of units,

    we can, once again, train models from them, but it is up to linguistics (and practice) to

    help us find good levels of structure and the units at each.

    Lexicons and addenda

    The basic assumption in Festival is that you will have a large lexicon, tens of

    thousands of entries, that is a used as a standard part of an implementation of a voice.

    Letter-to-sound rules are used as back up when a word is not explicitly listed. This

    view is based on how English is best dealt with. However this is a very flexible view,

  • 8/14/2019 Festival Speech Synthesis System

    16/19

    Festival Speech Synthesis System

    17

    An explicit lexicon isn't necessary in Festival and it may be possible to do much of

    the work in letter-to-sound rules. This is how we have implemented Spanish.

    However even when there is strong relationship between the letters in a word and

    their pronunciation we still find a lexicon useful. For Spanish we still use the lexicon

    for symbols such as . $., . %., individual letters, as well as irregular pronunciations. In

    addition to a large lexicon Festival also supports a smaller list called addenda this is

    primarily provided to allow specific applications and users to add entries that aren't in

    the existing lexicon.

    Letter to sound rules

    It is context sensitive rules. These rules can be generated by hand or automatically. In

    Festival there is a letter to sound rule system that allows rules to be written, but we

    also provided a method for building rule sets automatically which will often be more

    useful. The choice of using hand-written or automatically trained rules depends on the

    language you are dealing with and the relationship it has between its orthography and

    its phone set.

    Post-lexical rules

    In fluent speech word boundaries are often degraded in a way that causes co-

    articulation across boundaries. A lexical entry should normally providepronunciations as if the word is being spoken in isolation. It is only once the word has

    been inserted into the the context in which it is going to spoken can co-articulary

    effects be applied. Post lexical rules are a general set of rules which can modify the

    segment relation (or any other part of the utterance for that matter), after the basic

    pronunciations have been found. In Festival post-lexical rules are defined as functions

    which will be applied to the utterance after into national accents have been assigned.

  • 8/14/2019 Festival Speech Synthesis System

    17/19

    Festival Speech Synthesis System

    18

    Waveform Generation

    Traditionally we can identify three different methods that are used for creatingwaveforms from such descriptions.

    Articulatory synthesis is a method where the human vocal tract, tongue, lips, etc are

    modeled in a computer. This method can produce very natural sounding samples of

    speech but at present it requires too much computational power to be used in pratical

    systems.

    Formant synthesis is a technique where the speech signal is broken down into

    iverlapping parts, such as formats (hence the name), voicing, asperation, nasality etc.The output of individual generators is then composed from streams of parameters

    describing the each sub-part of the speech. With carefully tuned parameters the

    quality

    of the speech can be close to that of natural speech, showing that this encoding is

    sufficient to represent human speech. However automatic prediction of these

    parameters is harder and the typical quality of synthesizers based on this technology

    can be very understandable but still have a distinct no-human or robotic quality.

    Format synthesizers are best typified MITalk, the work of Dennis Klatt, and what was

    later commercialized as DECTalk. DECTalk is at present probably the most familiar

    form of speech synthesis to the world. Dr Stephen Hawkins, Lucindian Professor of

    Mathematics at Cambridge University, has lost the use of his voice and uses a formant

    synthesizer for his speech.

    Concatenative synthesis is the third form which we will spend the most time on. Here

    waveforms are created by concatenating parts of natural speech record from humans.

  • 8/14/2019 Festival Speech Synthesis System

    18/19

    Festival Speech Synthesis System

    19

    Diphone Synthesis

    A diaphone is an adjacent pair of phones.

    Here the main tasks included are:

    Combining phones Recording diaphones Labeling the diaphones

    Recording the diphones

    The object of recording diphones is to get as uniform a set of pronunciations as

    possible. Your speaker should be relaxed, not be suffering for a cold, or cough, or a

    hangover. If something goes wrong with the recording and some of the examples need

    to be re-recorded it is important that the speaker has as similar a voice as with the

    original recording, waiting for another cold to come along is not reasonable, (though

    some may argue that the same hangover can easily be induced). Also to try to keep

    the voice potentially repeatable it is wise to record at the same time of day, morning is

    a good idea. The points on speaker selection and recording in the previous section

    should also be borne in mind.

    Labeling the diphones

    Labeling nonsense words is much easier than labeling continuous speech, whether it is

    by hand or automatically. With nonsense words, it is completely defined which

    phones are there and they are (hopefully) clearly articulated.

  • 8/14/2019 Festival Speech Synthesis System

    19/19

    Festival Speech Synthesis System

    20

    CONCLUSION

    Festival offers a general framework for building speech synthesis systems as well as

    including examples of various modules. As a whole it offers full text to speech

    through a number APIs: from shell level, though a Scheme command interpreter, as a

    C++ library, from Java, and an Emacs interface. Festival is multi-lingual (currently

    English (British and American), and Spanish) though English is the most advanced.

    Festival is constantly being improved and added to in continuing our research in

    speech synthesis at CSTR

    REFERENCE

    http://www.cstr.ed.ac.uk/projects/festival/

    http://www.speech.cs.cmu.edu/festival/manual-1.4.1/festival_1.html