[email protected]

COMPUTER

Computer Speech and Language 21 (2007) 325–349

www.elsevier.com/locate/csl

SPEECH AND

LANGUAGE

A fuzzy decision tree-based duration model for StandardYoruba text-to-speech synthesis

O: d�e: tunjı A. O: d�e: jo:bı a,b,1, Shun Ha Sylvia Wong a,*, Anthony J. Beaumont a

a Computer Science, Aston University, Aston Triangle, Birmingham B4 7ET, UKb Room 109, Computer Buildings, Computer Science and Engineering Department, O: baf�e:mi Aw�o: l�o:w�o: University, Ile-If�e: , Nigeria

Received 13 April 2005; received in revised form 13 June 2006; accepted 17 June 2006Available online 10 August 2006

Abstract

In this paper, we present syllable-based duration modelling in the context of a prosody model for Standard Yoruba(SY) text-to-speech (TTS) synthesis applications. Our prosody model is conceptualised around a modular holistic frame-work. This framework is implemented using the Relational Tree (R-Tree) techniques. An important feature of our R-Treeframework is its flexibility in that it facilitates the independent implementation of the different dimensions of prosody, i.e.duration, intonation, and intensity, using different techniques and their subsequent integration. We applied the FuzzyDecision Tree (FDT) technique to model the duration dimension. In order to evaluate the effectiveness of FDT in durationmodelling, we have also developed a Classification And Regression Tree (CART) based duration model using the samespeech data. Each of these models was integrated into our R-Tree based prosody model.

We performed both quantitative (i.e. Root Mean Square Error (RMSE) and Correlation (Corr)) and qualitative (i.e.intelligibility and naturalness) evaluations on the two duration models. The results show that CART models the trainingdata more accurately than FDT. The FDT model, however, shows a better ability to extrapolate from the training datasince it achieved a better accuracy for the test data set. Our qualitative evaluation results show that our FDT model pro-duces synthesised speech that is perceived to be more natural than our CART model. In addition, we also observed that theexpressiveness of FDT is much better than that of CART. That is because the representation in FDT is not restricted to aset of piece-wise or discrete constant approximation. We, therefore, conclude that the FDT approach is a practicalapproach for duration modelling in SY TTS applications.� 2006 Elsevier Ltd. All rights reserved.

1. Introduction

The main problem with modern text-to-speech (TTS) synthesis systems is the poor quality of the generatedspeech sound. This poor quality results from the inability of traditional TTS systems to account for speech

0885-2308/$ - see front matter � 2006 Elsevier Ltd. All rights reserved.doi:10.1016/j.csl.2006.06.005

* Corresponding author. Tel.: +44 121 204 3473; fax: +44 121 204 3681.E-mail addresses: [email protected] (O: .A. O: d�e: jo:bı), [email protected] (S.H.S. Wong), [email protected] (A.J.

Beaumont).1 Supported by the Commonwealth Scholarship Commission, UK.

mailto:[email protected]



326 O: .A. O: d�e: jo: bı et al. / Computer Speech and Language 21 (2007) 325–349

prosody at an acceptable level of accuracy when compared to what is obtained in natural speech. Prosody is aninherent super-segmental feature of human speech (Shen et al., 1993; Minematsu et al., 2003) which has gen-erated a lot of research interest in TTS synthesis. That is because natural speech prosody comprises manydimensions and they interact in very complex ways. In addition to this complexity, information about speechprosody is not directly encoded in written text and has to be predicted.

Putatively, there are three dimensions of speech prosody: intonation, duration and intensity. The aim inTTS synthesis is to reproduce these three dimensions of speech as accurately as possible by implementing amodel that predicts all prosody phenomena that are known to be perceptually relevant in each of the iden-tified dimensions from an input text. We have previously reported on a modular holistic approach to pros-ody modelling in the context of TTS for tone languages (O: d�e: jo:bı et al., 2004a). The prosody modellingframework is implemented using the R-Tree techniques (Ehrich and Foith, 1976). The construction of anR-Tree involves using algorithms based on tone phonological rules to generate an abstract structure, calledthe skeletal tree (S-Tree), which represents the intonation contour of an utterance. The dimensions of theperceptually significant points on the skeletal tree (S-Tree) are then computed to synthesise the prosodyof the target utterance. A major attribute of our modelling paradigm is its flexibility. It enables the imple-mentation and evaluation of various dimensions of prosody using different techniques independently. Thisprovides a good test-bed for experimenting with various modelling techniques for each individual dimensionof speech prosody with the aim of selecting the best approach. A computational model for realising theintonation dimension has already been developed and demonstrated using the Standard Yoruba (SY) lan-guage (O: d�e: jo:bı et al., 2004b). That model uses fuzzy logic to compute the numerical values of the peaks andvalleys on the intonation contour of an utterance. In this paper, we present the modelling of the durationdimension for Standard Yoruba TTS.

The duration modelling problem in SY differs from those in non-tone languages like English. SY is a syl-lable-time language in which the syllable is the basic perceptual unit of an utterance. We have shown in a pre-vious work that the locations of the peak and valley on the tone of an SY syllable encode the mostperceptually significant points on the tone f0 curve (O: d�e: jo:bı et al., 2004b). It is clear from the findings ofresearch works on SY (Ladd, 2000) that the timing of such perceptually significant points is central to theintelligibility quality and semantic interpretation of speech sound in tone languages. Therefore, the problemin SY duration modelling is not how to account for the duration of segments, but how to accurately align thef0 contour of syllables through the appropriate timing of the voiced portions of syllables with an utteranceintonation contour. Within our prosody modelling framework, it is necessary to model the acoustic evolutionof speech sound by accurately timing the sequence of discrete description of speech sound. The model dependson the numerical data obtained from acoustic analysis of speech. The linguistic data obtained from perceptualexperiments are also used to establish the sufficiency or relative potency of individual acoustic cues. Thus, ourduration modelling relies on two kinds of data: (i) numerical and (ii) linguistic. The structure and pattern ofthese data determine the strategy for designing the model.

In order to exploit these two types of data, we applied the Fuzzy Decision Tree (FDT) technique to com-pute the relative duration of each syllable in an utterance. FDT has been applied to the modelling of variousproblems, such as power system security assessment (Boyen and Wehenkel, 1999), weather forecasting (Yuanand Shaw, 1995; Dong and Kothari, 2001) as well as in software quality models (Pedrycz and Sosnowski,2001). Suarez and Lutsko (1999) have assessed the performance of the FDT technique in real world problemssuch as classification of diabetes data, breast cancer data, heart disease data as well as in waveform recogni-tion. For example, Mitra et al. (2002) applied FDT to the recognition of vowels produced by a group of malespeakers in a Consonant–Vowel–Consonant context. The results of these applications suggest that FDTs arebetter in extrapolating from training data when compared with binary decision trees such as the ClassificationAnd Regression Tree (CART).

A survey of the literature on duration modelling, in the context of prosody modelling for TTS applications,suggests that CART is the most frequently used modelling technique. There is no reported work on the appli-cation of FDT to duration modelling. In this paper, we illustrate the application of FDTs to duration mod-elling and compare the results of our FDT-based model with that of a CART-based duration model. Wedemonstrate these duration models within the context of our R-Tree based prosody model using the StandardYoruba language as a case study.

O: .A. O: d�e: jo: bı et al. / Computer Speech and Language 21 (2007) 325–349 327

In Section 2, we provide a brief overview of the Standard Yoruba (SY) language. Section 3 provides adescription of the data used for creating our duration models. Section 4 gives an overview of the factors affect-ing duration in SY speech and our SY duration modelling. Section 5 contains a review of the literature onduration modelling in TTS applications. The FDT and CART based duration models are discussed in Sections6 and 7, respectively. An evaluation and discussion of our models is provided in Section 8. Section 9 concludesthis paper.

2. A brief description of the Standard Yoruba language

Yoruba is one of the four major languages spoken in Africa and it has a speaker population of more than30 million in West Africa alone (Crozier and Blench, 1976; Taylor, 2000). There are many dialects of the lan-guage, but all speakers can communicate effectively using Standard Yoruba (SY). SY is used in language edu-cation, mass media and everyday communication. The present study is based on the SY language.

The SY alphabet has 25 letters which is made up of 18 consonants (represented by the graphemes: b, d, f,g, gb, h, j, k, l, m, n, p, r, s, s: , t, w, y) and seven vowels (a, e, e: , i, o, o: , u). Note that the consonant gb is adiagraph, i.e. a consonant written in two letters. There are five nasalised vowels in the language (an, en, in,o:n, un) and two pure syllabic nasals (m, n). SY has three phonologically contrastive tones: High (H), Mid (M)and Low (L). Phonetically, however, there are two additional allotones or tone variants namely, rising (R) andfalling (F) (Connell and Ladd, 1990; Akinlabı, 1993). A rising tone occurs when an L tone is followed by an Htone, while a falling tone occurs when an H tone is followed by an L tone. This situation normally occurs dur-ing assimilation, elision or deletion of phonological object as a result of co-articulation phenomenon in fluentspeech.

A valid SY syllable can be formed from any combination of a consonant and a vowel or a consonant and anasalised vowel. When each of the 18 consonants is combined with a simple vowel, we will have a total of 126CV type syllables. When each consonant is combined with a nasalised vowel, we have a total of 90 CVn typesyllables. SY also has two syllabic nasals n and m. Table 3 shows the distribution of the components of thephonological structure of SY syllables.

Table 1Segmental phonemes of Standard Yoruba consonant

Manner of articulation Place of articulation

Bilabial Alveolar Palato-Alveolar Palatal Velar Labio-Velar glottal

Stops b t d k g kp_

gb_

Fricate f s � hAffricatesNasal m n ‚Flap rLateral lSemi-vowel w j

Table 2Standard Yoruba vowel system

Oral Vowels Nasal Vowels


It should be noted that although a CVn syllable ends with a consonant, the consonant and its precedingvowels are the orthographic equivalent of a nasalised vowel. There is no closed syllable and there is no con-sonant cluster in the SY language. The SY consonant and vowel systems are shown in Tables 1 and 2, respec-tively. The phonetic attribute of the nucleus is important to the f0 curve of the syllable because we view that thetone is anchored to the nucleus of the syllable.

3. Experimental data

There is no language resource developed for SY in the context of speech technology. We therefore devel-oped a speech database for the purpose of this research. We selected four popular SY newspapers and threeSY textbooks for creating our text corpus. The newspapers are: (i) Alaroye, (ii) Alalaye, (iii) Iroyın Yoruba,and (iv) Akede Agbaye. The three textbooks are two SY language education textbooks (Bamgbos: e, 1990; Owo-labı, 1998) and a book on SY culture (Ogunb:wale, 1966). In addition to these, we also composed a short SYstory and added the text to our SY text corpus. The purpose of composing the story is to add typical dialoguedomain text into the already collected texts. It also allows us to compare the tonal and linguistic distributionsin the different domains of SY text. The resulting corpus contains 95 sentences.

The analysis of the text database informed the selection of the text for our speech corpus. Out of the 690 SYsyllables (cf. Table 3), we selected 456 syllables. These syllables are carefully selected to reflect the coverage ofall syllable types in terms of phonetic and phonological distributions. For example, in the CV syllable type, themanner of articulation of the onset is considered. The onset consonants are selected from each manner ofarticulation classes, i.e. stop, labio-velars, fricates, affricate, sonorants or semivowels. The selected onset iscombined with each vowel type, e.g. Close rounded, Half-closed front, etc., in order to select the syllablefor each class of utterance. The same process is repeated for all the syllable types. The data set adequatelyrepresents all SY syllable types (i.e. CV, CVn, Vn, V and N).

Of the 95 sentences in our corpus, 60 sentences were selected for training our duration model. Forty of themare one-phrase sentences and each of the remaining 20 sentences contains two phrases. For the test data set, we

Table 3Phonological structure of SY syllables

Tone syllables (690)2

Base syllables (230) Tones(3)

ONSET (18) RYHME(14)

Nucleus Coda

Consonant Vocalic Non-Vocalic

C V(7) N(2) n(1) H, M, L

The numbers within a parenthesis indicates the total number of the specified unit.

Table 4Statistics for the characteristics of the training and test data sets

Category Description Training set Test set

Sentences One-phrase 40 (60%) 21 (67%)Two-phrase 20 (40%) 9 (33%)

Words Total word count 793 386

Syllable types CV 1007 414CVn 368 234Vn 291 58V 755 292N 38 20

Tone types H tone 984 458L tone 980 305M tone 492 255


have chosen 30 sentences from the remaining 35 sentences in our corpus. Twenty-one of them are one-phrasesentences and the remaining nine are two-phrase sentences. The distribution of syllable and tone types in thesesentences is shown in Table 4. The occurrence counts for each syllable type in context in both training and testdata sets are shown in Fig. 1.

Both the training and test data sets contain semantically well-formed statement sentences and they areselected to reflect common, everyday use of SY. Within the training data set, the minimum number of syllablesper sentence is 6 and the maximum is 24 syllables. The H and L tone syllables account for 40% each while theM tone syllables account for the remaining 20%. For the test data set, the minimum number of syllables persentence is 4 and the maximum is 19 syllables. The H, L and M tone syllables account for 45%, 30% and 25%each, respectively. Our training and test data sets contain statement sentences only because research on SYlanguage intonation has shown that the mode of the sentences does not affect the intonation (Connell andLadd, 1990). Since intonation has the closest proximity to sentence mode, we assume that its effects on theother dimensions of prosody, i.e. duration and intensity, if present, will be minimal.

To obtain the timing information of each syllable in context for our duration modelling and evaluationexperiments, we have recorded and annotated all of these 90 sentences. Of the 456 recorded syllables, 350of them form the training set and 106 form the test set. All of the 456 syllables were read by six participantsand their voices were recorded as discussed in the following section. However, only the recorded data of oneadult male speaker was used for the duration modelling. The other data were used in experimentation in orderto determine the factors that affect duration as well as in verification of our models.

3.1. Speech data recording and annotation

Six naıve adult native SY speakers, three females and three males read the speech for the selected syllablesand sentences in our training data set. The age of the speakers ranged from 21 to 36 years old. An ANDREANC

– 61 microphone was used for recording on a Pentium 4.2 GHz microcomputer system with on board soundcard. The recording took place in a quiet laboratory environment. Two freeware software products were usedin this experiment: Wavesurfer (Sjolander and Beskow, 2004) and Praat (Boersma and Weenink, 2004).

In all, 2100 syllables and 360 sentences were recorded for the training data set. Each of the speakers read 60sentences and 350 syllables. For the test set, the recording was carried out using only one male speaker.

In order to achieve good recording, recorded speech signals were inspected for the following defects:

• distortion arising from clippings,• aliasing via spectral analysis,• noise effects arising from low signal amplitude as a result of quantisation noise or poor signal-to-noise ratio

(SNR),• large amplitude variation, and• transient contamination (due to extraneous noise).

Recorded speech that had any of the above listed defects was discarded and the sound recording repeated.

Fig. 1. Occurrence counts for syllable type in context.


During the speech annotation, the recordings of both syllables and sentences from only one male speakerwere annotated. That data was used for the duration modelling. With the other speakers, all syllable record-ings were fully annotated and they were used to determine the duration affecting factors. In addition, somesentence recordings from these speakers were also annotated for verification purposes.

In the annotation of the syllable speech files, only one tier is specified, i.e. the syllable tier. The symbol * isused to annotate syllable boundaries. Each syllable is labelled with its letters, with its associated tone enclosedin parenthesis. For example, the syllable ba is labelled as ba(H) where ba are the graphemes of the syllable andH represents the high tone associated with the syllable.

Four tiers are specified for the sentence annotation: (i) syllable, (ii) word, (iii) phrase and (iv) sentence. Whenannotating the sentence speech files, the labelling order is: (i) sentence, (ii) phrase (if more than one), (iii) wordand (iv) syllable. This labelling order simplifies the detection of boundaries in smaller linguistic units since theirannotation is guided by the annotation of the larger tiers. For example, after annotation of the word tier in atwo-syllable word, the beginning of the first syllable and the end of the last syllable can be easily determined.This approach also reduces annotation error since larger units are much easier to identify physically from speechsound signals and perceptually from listening to a replay of the sound segment. Fig. 2 shows an example of anno-tation data for the SY sentence ‘‘Op�e: kı o to de, ko tete lo: .’’ (meaning ‘‘He came late, and did not go early.’’).

Both the spectrogram and the waveform are used to determine syllable and word boundaries. For certaintypes of syllables in the continuous speech for sentences, we found that boundaries between syllables werehard to determine. This occurred between V–V pairs or between V–CV pairs where C is a semi-vowel suchas /y/ or /w/. In this situation, we employed listening tests in addition to speech spectrograph and waveformcharacteristics for syllable boundary detection. Where the boundaries were in doubt, we found the earliest rea-sonable position and the latest reasonable position, then placed the boundary half-way in between. Forunvoiced plosives, i.e. /p/, /t/ and /k/, we placed the syllable boundary in the centre of the closure. We notethat the voiced plosives show strong segmental effects on the f0 curves of the syllable in which they occur.

3.2. Factors affecting syllable duration in SY

An ideal duration model in a TTS system should consider all contextual factors that affect duration andaccount for the timing of all speech sounds. However, the consensus in the literature is that this task ispractically impossible (Campbell and Isard, 1991; Brinckmann and Trouvain, 2003) and there is the needto simplify the duration modelling problem to a manageable complexity without compromising crucialperceptual information. To establish the factors that affect SY syllable duration, we have conducted informal

* O(H) pe(H) Ki(H) o(H) to(H) de(H) , Ko(L) te(L) te(L) lo(M) *

* Ó pé Kí ó tó dé , Kó tété lo *

* Ópé Kí ó tó dé , Kó tété lo *

* Ópé Kí ó tó dé, Kó tété lo *

Time (s)0 1.6175

Time (s)0 1.6175

45

170

Fre

quen

cy (

Hz)

Fig. 2. Example annotation data for the sentence ‘‘Op�e: kı o to de, ko tete lo: .’’


experiments on 60 recorded sentences and syllables from six native SY speakers (cf. Section 3). In these exper-iments, we considered nine factors that affect duration. If we represent the target syllable (i.e. syllable forwhich duration is to be computed) as Stag and the word in which it occurs as Wtag, the factors that we selectfor our duration data analysis are:

(1) The position of syllable in Wtag, SPoWtag .

(2) Position of Wtag in the sentence, W PoStag .

(3) Length of Wtag, W lentag, calculated as the number of syllables it contains.

(4) The peak of the f0 curve on Stag, Sf0tag.

(5) The phonetic structure of the target syllable Stag, i.e. Sphotag .

(6) The phonetic structure of the preceding syllable Spre, Sphopre .

(7) The phonetic structure of the following syllable Sfol, Sphofol .

(8) The peak of the f0 curve on preceding syllable Spre, Sf0pre.

(9) The peak of the f0 curve on following syllable Sfol, Sf0fol.

The f0 curve on each syllable is approximated by a third degree polynomial using the stylisation (d’Alessan-dro and Mertens, 1995) technique. The results of our experiments show that the tone of the preceding andfollowing syllables, in terms of the peak f0 values, do not affect the duration of the target syllable. Thus,we considered only the first seven factors listed above in our duration modelling.

4. SY syllable duration modelling

The duration of a syllable spoken in isolation differs from its duration when it occurs in the context of anutterance. This implies that the factors that affect the duration of syllables in the context of fluent speech actu-ally modify the canonical duration. This modification can produce three effects on the duration of the canon-ical syllable: (i) decrease the duration, i.e. compress the syllable, (ii) leave the duration unchanged, or (iii)increase the duration, i.e. stretch the syllable.

Let g be the scaling factor for the duration of a canonical syllable. The value of g can be calculated using amultiplicative model which simply multiplies the value that each factor contributes to the change in the dura-tion of a canonical syllable. Chen et al. (2003) have shown that such multiplicative models perform better thanadditive models. However, such a simplistic multiplicative model may introduce further problems, if the cal-culation is not restricted to the factors whose contribution is a positive non-infinite value.

Let Lr and Lc be the realised and the canonical duration of a syllable, respectively. Let LI be the amount ofincrease or decrease of a canonical duration of a syllable. If g is defined such that �1 < g 6 1.0, we can for-mulate the equation for computing LI, given Lc, as:

LI ¼ gLc ð1Þ
where g denotes the syllable duration modifier. Lr can then be computed using:
Lr ¼ Lc þ LI ð2Þ
In Eq. (1), g acts as a scaling factor for the duration of a syllable. If g = 0, it implies that the realised syllableduration is the same as the canonical duration of the syllable in question. That is the case when monosyllabicwords are spoken in isolation or at the beginning of a short sentence. When g < 0, the realised duration is re-duced by the factor specified by g. For example, if g = �0.5, it implies that the realised duration is 50% shorter(compressed to half of its canonical size) than the canonical duration. Likewise, if g > 0, the canonical durationof the syllable is increased. Our aim is to develop a model that predicts g by establishing a relationship betweenthe set of factors that affect syllable duration and the duration of a syllable in the context of an utterance.
5. Contemporary approaches to duration modelling

Reported works in the literature (e.g. Allen, 1994; Bellegarda et al., 2001; Huckvale, 2002) all agree thatthe present knowledge about speech duration, and the state of the art in speech technology in general, is still


rudimentary and that our understanding of duration patterns and the many sources of variability whichaffect them is still sparse (Mobius, 2003). Since the acoustic signal is best represented quantitatively, we needa computational model that is capable of capturing and relating the numerical (quantitative) and the richqualitative knowledge underlying the linguistic structure of speech timing. To incorporate this knowledgeinto a TTS duration model, we need to recognise that speech is primarily a linguistic entity and its acousticmanifestation as waveforms is meant to communicate the embedded linguistic message. That explains thedesire for duration models to capture the relationship between linguistic information and a huge rangeof values assigned to the durational features of speech. An engineering approach to this problem will beto design a model which employs a strategy in which durations are related to the perception of acousticwaveform.

There are two principal classes of methods applied in the design of duration models for modern TTS sys-tems. They include: (i) rule-based methods and (ii) data-driven methods. The data-driven methods can be fur-ther divided into two groups: (a) iterative optimisation methods and (b) machine learning methods. Theduration models resulting from these methods differ in structure as well as in the manner in which they useduration affecting factors to assign duration to units of utterance.

In rule-driven methods, a set of If-Then rules are designed based on the durational pattern observed in astudy of natural speech waveforms (Hohne et al., 1983). These rules are used to modify the duration of seg-ments with the aim of producing a quality of match between the natural and synthetic speech. Underpinningthis approach is the idea that, by experimenting with a number of sentences, and speakers, one could hopefullymake a major improvement in the predicted duration and hence the quality of synthetic speech obtained. TheKlatt (1987) duration model is perhaps the most popular rule-driven duration model. This model predicts seg-mental duration by starting from some intrinsic values. The intrinsic duration is modified by successivelyapplying rules which are intended to reflect contextual factors, such as positional and prosodic factors, tolengthen or shorten the segment.

The Klatt duration model is specified by an equation which takes into account the inherent and mini-mum duration of a segment, measured in milliseconds. The percentage increase or decrease in the durationof segments is determined by applying If-Then rules. A major weakness of this model is that rule param-eters are determined by a manual trial-and-error process. Manually exploring the effect of mutual interac-tions among linguistic features of different levels is a highly complex and error-prone process. Moreover, themodel does not provide a systematic structure for determining how to include or exclude a factor thataffects duration. Hence, the rule inference process usually involves a controlled experiment, in which onlya limited number of contextual factors are examined. In addition, the application of this model in a sylla-ble-based duration model, such as ours, will require that we treat syllables as segments. That will limit theflexibility of our model because syllable-sized durations are generally less variable than segment duration(Keller and Zellner, 1995). Using this approach will also introduce some inaccuracies in the representationof f0 anchor points which are crucial to the location of f0 peaks and valleys on the intonation contour ofour prosody model.

In iterative optimisation methods, a basic mathematical model that describes the duration pattern is firstderived and then optimised using speech duration data. The Sum-Of-Product (SOP) method (van Santen,1992, 1994) is a typical data-driven iterative optimisation method which applies both addition and multipli-cation to the computation of speech unit duration. The SOP model has been used in many TTS applications.It is particularly suitable for computing syllable segment duration. The idea underlying the design of thismodel is that the regularity in the interaction of factors affecting duration can be described by a class of simplearithmetic equations. An SOP model treats the factors that affect duration as independent variables in a for-mula that computes a dependent variable, i.e. duration. To achieve this, a monotonically increasing transfor-mation function, F(), is used in conjunction with another function, D(), that can be decomposed as a sum ofproduct of single factor parameters. The strength of the SOP model is that the number of parameters requiredin the model is sufficiently small, and the arithmetic operation of multiplication and addition are mathemat-ically sufficiently well behaved that parameters can be estimated even under conditions of severe frequencyimbalance (van Santen, 1994; Chung and Huckvale, 2001). However, Bellegarda et al. (2001) have observedthat the diagnosis of an N-variable function, on the basis of joint independence, requires the testing of(N � 1) tuples of variables for independence of the Nth. Such diagnosis is not always successful, because, apart


from requiring a considerable effort in generating the model, it has been shown that the sum-of-product func-tion is not a generalised additive function and the choice of the usual log function for the monotonicallyincreasing transformation function, F(), is probably not optimal (Bellegarda et al., 2001).

In machine learning approaches, the aim is to automatically generate a duration model from a large anno-tated speech corpus, usually with the aid of statistical methods, such as the Classification And Regression Tree(CART) (Lee and Oh, 1999; Chung, 2002), or automatic machine learning techniques, such as Artificial Neu-ral Network (ANN) (Fletcher and McVeigh, 1993; Chen et al., 1998; Vainio, 2001), Bayesian Model (Gouba-nova and Taylor, 2000) or Hidden Markov Model (HMM) (Levinson, 1986; Donovan, 1996). CART isperhaps the most popular data-driven method for duration modelling in TTS applications. CARTs are par-ticularly attractive because standard tools for their generation are widely available, and, in contrast to otherdata-driven methods, the computed regression tree is interpretable. An additional strength of CART is theease with which trees may be built from duration data and also the speed of classification of new data. Leeand Oh (1999) have shown that CARTs can cope with complex confounding interaction between factors thataffect duration because it makes very few assumptions about the structure of the data.

CART embodies a binary branching tree with questions about the influencing factors at the nodes and pre-dicted values at the leaves (Riley, 1992; Breiman et al., 1984). The tree itself contains yes/no questions aboutfeatures and ultimately provides either a probability distribution, when predicting categorical values (classifi-cation tree), or a mean and standard deviation when predicting continuous values (regression tree). Well-defined techniques can be used to construct an optimal tree from a set of training data. Furthermore, CARTinduced trees can easily be converted into rules by viewing all the nodes which lead from the root to a leaf asthe antecedent of a rule and the corresponding leaf as the consequence. Therefore, a major strength that theCART approach has over other data-driven methods is that CART output is more readable and often under-standable by humans. This feature is particularly important when developing a duration model for a new lan-guage as it makes it possible to iteratively evaluate and improve the model.

However, it is well-known that CART is unable to accurately extrapolate from known to unknowncontexts (Riley, 1992). Furthermore, due to the way that a CART is structured, CART allows either a singlefeature or a linear combination of features at each internal node. This makes CART, like other binary deci-sion-tree algorithms, biased towards generality. Another well-known weakness of CART is its inability to han-dle sparse data.

van Santen (1994) has shown that SOP models can successfully handle the data sparsity problem. There-fore, a model that contains a mixture of probabilistic and prescriptive elements would be better suited forour duration modelling. The Fuzzy Decision Tree (FDT) modelling technique meets this requirement (Jani-kow, 1998; Huang and Liang, 2002). The major strengths of fuzzy logic algorithms are that they are robustand flexible and that they are able to cope well with interactions of linguistic attributes. Hence, they can beeasily tailored to cope with small disjuncts, which are associated with large degrees of attribute interaction(Carvalho and Freitas, 2002). In light of the characteristics of FDT, we hypothesise that FDT could be a suit-able technique for duration modelling in the context of TTS. In this paper, we describe our application ofFDT in duration modelling.

6. FDT in duration modelling

The motivation for selecting the fuzzy decision tree approach for our duration model is founded upon thehypothesis that proportionate relationship among confounding factors that affect duration at various phono-logical levels can be captured by an appropriately-designed model. Such a model must, on one hand, establisha relationship between the linguistic levels and qualitative description of duration phenomena. On the otherhand, it must facilitate a transparent link between qualitative descriptions and quantitative values that isresponsible for the timing of speech waveforms.

The FDT is an appropriate model in this context because it does not impose any arithmetic or multiplica-tive restrictions (relationship), or any inherent linearity by way of empirical rules. Thus it is able to exploit avery important property of interaction between factors that affect duration, i.e. these interactions are oftenregular in the sense that the effects of one factor does not reverse that of another (van Santen, 1994; Campbell,2000).


An FDT model facilitates the computation of a more globally optimal result because it has the ability tocompute the relative effects of all child nodes corresponding to factors affecting duration on the duration of asyllable, before subsequently combining and aggregating them through the defuzzification process.

6.1. Problem formulation

We can formulate the duration modelling problem as a classification/regression problem. This is because anumber of independent variables (i.e. the factors that affect duration) are used to compute the duration of asyllable.

Given a set of training samples composed of observed input/output pairs that consists of N labelledexamples, {(xn,yn);n = 1,2, . . . ,N}, our aim is to derive a general model which can be used to computeoutput values for any new set of inputs. The new inputs may be in the training set or test set. In thecontext of duration modelling, the input variables (i.e. the attributes) are the relevant parameters describ-ing the factors affecting the duration of the unit of utterance in focus, e.g. syllable or phone, and the out-put would be a numerical value specifying the actual duration or modification to each speech unit in anutterance (i.e. g).

To formulate this problem as a fuzzy classification/regression problem, let U = {uj}, j = 1, . . . ,n representthe universe of objects that describe the factors affecting the duration of a syllable. Each of these n objects isdescribed by a collection of attributes A = {A1, A2, . . . ,Ar}. Each attribute Ak measures some important fea-tures of an object and can be limited to a set of m linguistic terms T ðAkÞ ¼ T k

1; Tk2; . . . ; T k

m. T(A) is the domainof the attribute of Ak. Each numerical attribute Ak can be defined as a linguistic variable which takes linguisticvalues from T(Ak). Each linguistic value T k

j is also a fuzzy set defined over the range of the numerical values ofthe variable, i.e., its Universe of Discourse (UoD). The membership function lT k

jindicates the degree to which

object u’s attribute Ak belong to T kj . The membership of a linguistic value can be subjectively assigned or

inferred by a membership function defined over its UoD.

6.2. Fuzzy Decision Tree (FDT) design

The potential of fuzzy decision trees in improving the robustness and generalisation in classification is dueto the use of fuzzy reasoning. Underlying fuzzy reasoning is the concept of a fuzzy set. A fuzzy set is repre-sented by a membership function which maps numerical data onto the closed interval [0, 1]. While in classicallogic, the result of the operations of conjunction and implication are unique, in fuzzy logic there is an infinitenumber of possibilities. When a crisp number from the universe of the class variable is sought and the numberof other restrictions on fuzzy sets and operators are applied, rules can be evaluated individually and then com-bined. This approach, called local inference, gives a compositionally simple alternative, with good approxima-tion characteristics, even when all the necessary conditions are not satisfied.

A fuzzy decision tree gives results within the closed interval [0,1], as the possibility degree of an objectmatching a class. Fuzzy decision trees therefore provide a more robust way to avoid misclassification. Eachpath of a fuzzy decision tree, from the root to a leaf, forms a decision rule, which can be represented inthe form: IF(x1 ISA1)AND (x2 ISA2) . . . AND (xn ISAn)THEN(class = Cj). In the case of our model, eachxi represents a factor that affects duration and the Ai are the constraints defined over the universe of discourseof the factor. The Ci is the duration scaling class (i.e. Increase or Decrease).

Our aim is to exploit the Fuzzy ID3 algorithm developed by Janikow (1998) for addressing the problem ofduration modelling in the context of SY prosody modelling. The Fuzzy ID3 which we have adopted (Janikow,1998) differs from the traditional ID3 algorithms (e.g. Quinlan, 1986) in that the algorithm does not create aleaf node only if all data belong to the same class, but it also does so in the following cases: (i) if the proportionof a data set of a class Ck is greater than or equal to a threshold, (ii) if the number of elements in a data set isless than a threshold, or (iii) if there are no more attributes for classification. More than one class name may beassigned to one leaf node. In addition to these, the fuzzy set of all attributes are defined depending on the pat-tern of the data. Each attribute is processed as a linguistic variable using fuzzy restrictions such as X1 IS Low,X1 IS Medium, etc. Our FDT duration model implementation follows the steps in the literature (e.g. Yuan andShaw, 1995; Olaru and Wehenkel, 2003):


(1) Fuzzify the training data.(2) Build a set of fuzzy decision trees.(3) Obtain an optimal tree using pruning techniques.(4) Apply the FDT for predicting duration.

In the following subsections, we describe how these steps are applied to the design of our FDT based dura-tion model.

6.2.1. Fuzzification of the input space

The FDT is an approximation structure that computes the degree of membership of the duration affectingfactors to a particular syllable duration scaling class (i.e. Increase or Decrease). There are two types of data inour duration model: categorical and numerical. As shown in Table 5, four of the seven input variables in ourduration model are numerical and are treated as continuous variables. The numerical data must be fuzzifiedinto linguistic terms through the fuzzification process. The fuzzy membership functions used to fuzzify thenumerical data are derived as follows.

We assume that these variables are factorable such that fuzzy subsets can be defined over their Universe ofDiscourse (UoD). Since all the factors are normalised, their UoD is defined over the closed interval [0,1]. Wefirst partition the UoD for each of the numerical variables into subranges, with each subrange labelled with alinguistic term. For simplicity, we restrict the number of linguistic terms to 3 for continuous input variablesand to 2 for the output variables (i.e. Increase and Decrease). We used the trapezoidal function to modelour membership functions because it is simple and there are algorithms for deriving and implementing them(Kosko, 1994). In addition, the trapezoidal membership function is frequently used in fuzzy theory to modelrelatively stable data such as syllable duration. The algorithm for generating the membership functions in ourmodel is described as follows.

Assume that a factor that affects duration, A, has numerical value x. The numerical value of attribute A forall linguistic terms u 2 U can be represented by l = X = {x(u), u 2 U}. We defined the trapezoidal functionfor each variable as a four-tuple (Mitaim and Kosko, 2001; Kosko, 1994) (lj,mlj,mrj,rj) where mlj 6 mrj 2 R.The variables lj > 0 and rj > 0 denote the distance of the support of a function to the left and right of mlj andmrj, the centre of which is mj = 1/2(mlj + mrj). The degree to which a crisp value x belongs to the fuzzy set uj,i.e., luj

ðxÞ 2 ½0; 1�, is computed using the membership function:

TableFactor

No.

1.2.3.4.5.6.7.

lujðxÞ ¼

1:0� mlj�xlj

if mlj � lj 6 x 6 mlj

1:0 if mlj 6 x 6 mrj

1:0� x�mrj

rjif mrj < x 6 mrj þ rj

0:0 otherwise

8>>>><>>>>:

ð3Þ

The graphical representation of Eq. (3) is shown in Fig. 3. The membership functions of each of the fourinput variables are shown in Fig. 4. Fig. 5 depicts the membership function of the output variable describedin Table 6.

5s affecting syllable duration

Affecting factor Type Values/fuzzy terms

Length of word in which the syllable occurs Numerical Short, Medium, LongPosition of the syllable in the word Numerical Initial, Medial, FinalPosition of the word in the sentence Numerical Initial, Medial, FinalValue of f0 peak of syllable Numerical Low, Mid, HighStructure of preceding syllable Categorical CV, V, CVn, Vn, NStructure of target syllable Categorical Blank/pause, CV, V, CVn, Vn, NStructure of following syllable Categorical Blank/pause, CV, V, CVn, Vn, N

Fig. 3. Graphical representation of membership function.

(a) (b)

(c) (d)

Fig. 4. Membership function of continuous duration affecting factors. (a) Membership function for word length. (b) Membership functionfor position of syllable in word. (c) Membership function for position of word in sentence. (d) Membership function for peak f0 values oftone.


6.2.2. Building the FDT

We used the FID3.3 software developed by Janikow (2004) for building the FDT. The variables and param-eters required for implementing our FDT-based duration model are defined in Table 7. From our training dataset of 60 SY statement sentences (cf. Section 3.2), we generated a set of 250 data items. Each data item cor-responds to a syllable in a sentence and it comprises the values of the seven factors listed in Table 5. The dataset is split into two disjoint parts: 220 data items were used to build our FDT model and the remaining 30 wereused for cross validation. Out of the 220 data items, we first built our FDT using 200 items. To obtain anoptimum tree, the resulting tree was then pruned using the remaining 20 data items. The pruning process isdescribed in Section 6.2.3. The algorithm depicted in Fig. 6 implements the tree building process.

Fig. 5. Membership function for the output.

Table 6Syllable duration predicted

No. Predicted output Fuzzy restrictions

1. Degree to which the syllable is stretched or compressed Increase, Decrease

Table 7A summary of our FDT variables, functions and parameters

Variable/function/parameter Description

Vi A variable to represent one, i.e. the ith, of the duration affecting factorsV i

p A fuzzy term p defined for variable Vi ðe:g: V Word LengthiShort Þ

l() The membership function lviðxÞ for variable Vi defined over the crisp input u. It determines how the

crisp value for variable Vi satisfies the restriction [Vi is vij]. E.g. lWord Length

Long ðxÞ determines the degreeto which the value x satisfies the fuzzy restriction [Word_Length IS Long]. The derivation of themembership functions is explained in Section 6.2.1

f1() An aggregation function that combines the level of satisfaction of the fuzzy restrictions of theconjunctive antecedent

f2() A function that propagates the satisfaction of the antecedent to the consequenceX N

j The membership of examples ej in the node N. It is computed incrementally using f0 and f1.XN {XN} is the set of memberships in node N for all training examplesDi Fuzzy set for the input variable Vi. E.g. Di = {Short, Medium, Long} for Vi = Word_Length

jDij Cardinality of fuzzy set Di i.e. the number of linguistic terms defined over Vi. For all of our inputvariables, jDij = 3

P NK Example count for decision V c

k 2 Dc in node N

P N jðuiunknownÞ The total count of examples in node N with unknown values for Vi

P N jV ip The total count of examples in node N with V i ¼ V i

pIN jV i

p The information contents in node N with V i ¼ V ip

VN The set of attribute appearing on the path leading to node N

GNi Information gain computed as IN � ISN

V i


A number of trees were generated by varying the parameters for running the FID3.3 program. The fuzzydecision tree shown in Fig. 7 illustrates the structure of such trees. Each nonterminal node of the FDT con-tains: (i) the attribute used to split the node (i.e. Attr) and (ii) the total example count for each decision (i.e.increase and decrease) in the node. The two values in the terminal nodes indicate the example counts for eachof the two possible decisions.

The example count NNj is computed as the membership of example ej in N. It implies the membership in the

multidimensional fuzzy set defined by the fuzzy restrictions found in FN. It is computed incrementally usingthe functions l() (cf. Eq. (3)) and f1() as explained in Table 7.

Fig. 6. FDT building algorithm.


The information gain is used to determine the candidate input factors that will be used to partition the dataset. To determine the factor that would create an optimal partition of the data, we compute the weighted infor-mation content for each factor affecting duration. To compute the information gain, we first compute the stan-dard information content of a factor, IN (cf. Table 7). The weighted information content of the factor over thespeech data, adjusting for missing values, i.e. ISN

V i , is also computed. The difference of these two values (i.e.IN � ISN

V i ) is the information gain for the factor under consideration. The data is partitioned based on the fac-tor that has the highest information gain. This partitioning process is repeated until the remaining data itemsdo not yield a unique classification.

As shown in Fig. 7, the position of word in sentence, W PoStag , produced the highest information gain over the

entire data set and it is at the root of the FDT. The path along the Final linguistic term defined over W PoStag leads

directly to a terminal node whose example count for decreasing the syllable duration (i.e. Dec = 0.22) is far lessthan that for increasing the syllable duration (i.e. Inc = 106.21). This shows that the duration of a syllable inthe word at the final position of a sentence has a very high degree of increase. This tree pattern confirms thewell-known final lengthening phenomenon in SY (Connell and Ladd, 1990). The high degree of increase thatthe final syllable undergoes caused the partition to end for syllables at the final position as predicted by theFDT. The exact amount of the increase that the syllable duration will undergo is computed by the defuzzifi-cation process explained in Section 6.2.4. Note that the tree in Fig. 7 is built using only those duration affectingfactors with numerical values.

Fig. 7. FDT for numerical values duration factor.


6.2.3. Fuzzy decision tree pruning

Since the FDT3.3 program does not incorporate a pruning algorithm, the decision trees generated abovevary in size and structure. This influences the performance of both the tree and the fuzzy rules that will beextracted from it. There is the need to prune the decision tree generated in order to achieve optimal perfor-mance. In order to evaluate the efficiency of the decision trees, we applied the T-measure developed by Mitraet al. (2002). The criteria underlying the T-measure are:

(1) The shallower the depth of the tree, the better it is since it will take less time to reach a decision.(2) The presence of an unresolved terminal node is undesirable.(3) The distribution of labelled leaf nodes at different depths affects the performance of the tree. A tree

whose frequently-accessed leaf nodes are at shallower depths is more efficient in terms of time.

The T-measure for a decision tree is computed using Eq. (4).

T ¼2n�

PN lnodes

i¼1

widi

2n� 1ð4Þ

wi ¼NiN for a resolved leaf node

2NiN otherwise

(ð5Þ

where n = 7 is the number of attributes of a pattern, di is the depth of a leaf node, Nlnodes is the number ofterminal (leaf/unresolved) nodes, N = 200 is the total number of patterns in the training set and Ni is the totalnumber of training patterns that percolate down to the ith leaf node. The value of T lies in the interval [0,1). Avalue of 0 for T is undesirable and a value close to 1 signifies a good decision tree. Using this measure, weselect the best decision tree among those generated by the FID3.3 software discussed above. Fig. 8 showsthe resulting tree.

Fig. 8. Fuzzy decision tree for the duration model.


6.2.4. Applying FDT to duration modelling

The solution provided by FDT is based on estimates made at all leaf nodes of the tree. The final decision isobtained by a collection of alternative decision paths that branch out from the root node and end at a leafnode of the FDT (Suarez and Lutsko, 1999).

To compute the effect on syllable duration of a given factor, es, the FDT algorithm evaluates the successionof tests from the root node, following a path that is determined by the result of those tests at each of the inter-nal nodes. Eventually this path leads to one terminal node, say tl. The degree F N

l , which the duration sample es

belongs to the leaf node tl, is then computed. The F Nl values for all paths that start from the root to a leaf node

is computed in this manner. The final prediction is made by combining or aggregating these values. For anygiven vector of factors that affect duration, the value of the predicted duration modifier is equal to theweighted average of the F N

l values given by each of the leaves. The weight of a given leaf in the average isthe degree of membership of the example to the leaf in question. The computation of the final duration mod-ification factor is achieved by the defuzzification process.

FID3.3 provides a number of defuzzification schemes for achieving this goal. They include: (i) the bestmajority class, (ii) centre of gravity, (iii) maximum majority class. We adopt the best majority class schemein our model because it produces better accuracy.

7. CART duration model

In order to compare the performance of the FDT-based duration model with the standard CART method,we implemented a CART-based duration model using the Edinburgh University CART building software‘‘Wagon’’ (Black et al., 1999). The development of a duration model based on CART involves building a treeby training it on the input (i.e. affecting factors)-output (i.e. syllable duration modifier) data collected inrespect of speech duration. The tree building algorithm successively divides the feature space to minimisethe prediction error in duration values. After the tree construction phase, a relatively large tree Tmax isobtained. Some branches of Tmax are successively pruned resulting in a sequence of trees. The best amongthese trees is selected using a test sample that is independent of the training sample. That results in a tree withoptimal performance. The pruning process is done automatically.


Our CART model was built using the same set of training data and test data as in the development of theFDT. The sets are presented in the form {(xn,yn);n = 1,2, . . . ,N}, where xn are feature vectors of the corre-sponding affecting vector and yn are the scaling value for syllable duration. The contents of the input file isshown in Fig. 9.

The variables PSylType, TSylType and FSylType correspond to the structure of the preceding, target andfollowing syllables, respectively. The variables NumberOfSyllable, PositionOfSyllable, and PositionOfWord

are the number of syllables in the word (word length), the position of the syllable in the word, and the positionof the word in the sentence, respectively. The variable DegOfIncren is the dependent variable and it is thedegree of stretch or compression of the target syllable. F0Value is the f0 peak of the target syllable. The f0 curveon each citation syllable was stylised using a third degree polynomial (d’Alessandro and Mertens, 1995). Thepeak of the stylised curve (i.e. the F0Value in our CART input description file) is taken as a numerical value torepresent the tone of the syllable. That value (when compared with discrete tone types) gives more informationabout the tone on the syllable.

The tree building process starts with the tree consisting of only the root node t1 containing all cases. Thetask is to find the optimal binary split of the data. For real value features, i, all splits of the form xn

i < s aretested, where s denotes a predefined threshold value. For the M-value categorical feature i, the splits have theform xi 2 h, where h goes through all subsets of the set of all possible values of features i. The best split acrossall features is selected and the data in the root node is split into left and right nodes, i.e. (tL,tR). This procedureis applied recursively to all descendants until a stopping condition is fulfilled.

The CART tree is built in an incremental fashion. We set aside some of the training data for cross valida-tion. The tree building process begins with a small stop value of 8. The stop value is the minimum number ofsamples required in a tree partition before a split is attempted. The stop values are varied during each iterationof the tree building process. During each iteration, the generated tree is pruned back to where it best matchesthe set aside data. We have used the stop values 8, 9, 10 and 12 and found that the stop value of 9 gave anoptimum tree.

We expect our CART-based duration model to predict the value of the scale factors for the duration of asyllable which is then used to compute the realised duration. The syllable duration is calculated by theequation:

Duration ¼ Durationc þ ðDurationr � PrecdScaleÞ ð6Þ
where Durationc and Durationr are the canonical and realised duration, respectively. PrecdScale is the pre-dicted scaling factor for compressing/stretching the syllable duration. For example, if PrecdScale is �0.25,the syllable is reduced by 25% of its original duration. If PrecdScale equals 1.0, the syllable duration is doubled(cf. Eq. (1)). A typical tree for the numerical factors that affect duration, generated by the CART is shown inFig. 10. The optimal tree generated by the CART algorithm, comprising all the duration affecting factors, isshown in Fig. 11.
Fig. 9. CART input description file.

Fig. 10. CART Tree for numerical duration affecting factors.

Fig. 11. Optimal CART for duration model.


8. Evaluation and discussion

In terms of theoretical computational complexity, the CART model should outperform our FDT model.That is because, in the worst case, i.e. with completely overlapping subsets, the complexity of building a bal-anced fuzzy decision tree will be O(kGSk2 · kak) where kGSk is the number of learning instances used forbuilding the tree and kak is the number of candidate attributes. This evaluates to O(k300k2 · k7k) =O(6.3 · 105) for our FDT model. That is significantly worse than O(kGSklogkGSk · kak), i.e.(O(k300klogk300k · k7k) = O(5.2 · 103)), which is the complexity of building a crisp decision tree. Also,the search for an optimal dichotomy will be significantly more demanding in FDT than in the crisp discret-isation procedure (Boyen and Wehenkel, 1999; Olaru and Wehenkel, 2003).

However, the theoretical evaluation does not necessarily correlate with practical performance. To access thepractical performance of the models, we carried out qualitative and quantitative evaluations on both durationmodels. We have used both the training and test data sets discussed in Section 3 for our evaluations. For thequantitative evaluation, we applied the Root Mean Square Error (RMSE) (Hermes, 1998; Clark and Duster-hoff, 1999) and the Pearson’s correlation of the actual versus predicted duration for the two models. The tran-scription accuracy and the Mean Opinion Scores (MOS) (Donovan, 2003; Sakurai et al., 2003) were used toevaluate the intelligibility and naturalness, respectively, in the qualitative evaluation.


8.1. Quantitative evaluation

The quantitative evaluation provides a performance index of how the model fits the data. A high cor-relation and low RMSE indicate a good fit. The results of the quantitative evaluation of the two durationmodels are shown in Fig. 12. When considering the quantitative evaluation results from individual syllabletypes, the FDT-based duration model produces lower RMSE and higher correlation for the CV and Ntype syllables from the training data set. For example, while the FDT model produced an RMSE of15.11 ms and a correlation of 0.91 for the training set for CV type syllables (see Fig. 12(a) and (b)),the CART model produced an RMSE of 17.65 ms and a correlation value of 0.87 (see Fig. 12(c) and(d)). This pattern is repeated for the N type syllables where the FDT model (RMSE = 10.51 ms,Corr = 0.91) is better than the CART model (RMSE = 10.99 ms, Corr = 0.87). When the overall durationdatabase is considered, the CART model (RMSE = 13.92 ms, Corr = 0.88) performs slightly better thanthe FDT model (RMSE = 14.12 ms, Corr = 0.87) in training data but the FDT model (RMSE = 17.59 ms,Corr = 0.79) outperforms the CART model (RMSE = 22.15 ms, Corr = 0.75) on test data. We observedthat the difference in this quantitative performance is consistent but relatively small. Our results confirmthe observations from Riley (1992) that CART is weak in extrapolating from known to unknown contextsaccurately.

To put our evaluation results in the context of contemporary work on duration modelling for other lan-guages, we have included the quantitative results of those models in Table 8. The results show that ourFDT and CART models compare well with other state-of-the-art models. However, it is well-known thatquantitative results need not correspond to perceptual quality of the synthesised speech. In order to establishthe practical performance of our models, we performed qualitative evaluations.

Fig. 12. Quantitative evaluation of the FDT and CART based duration models.

Table 8Quantitative results for various duration models (based on test set results)

Language Model type RMSE (ms) Corr.

American English (van Santen, 1994) SOP – 0.90Korean (Lee and Oh, 1999) CART 22.00 0.82Korean (Chung, 2002) CART 25.11 0.77Czech (Batusek, 2002) CART 20.30 0.79Mandarin (Chen et al., 2003) Regression 15.47 –Mandarin (Chen et al., 2003) Hybrid statistical/regression 11.18 –Mandarin (Lin et al., 2003) Recurrent fuzzy neural network 20.16 –Our SY FDT model FDT 17.59 0.72Our SY CART model CART 22.15 0.75


8.2. Qualitative evaluation

Our preliminary qualitative evaluation involves a measure of how the perceptual quality of the synthesisedspeech mimics that of the natural speech in terms of the intelligibility and naturalness. The same training andtest sets used for the quantitative evaluation were also used for the qualitative evaluation.

Nineteen naıve adult native SY speakers were invited to participate in the qualitative tests. To ascertaintheir hearing ability, they were all subjected to an initial screening process. This process involves playing somenatural speech sound to them and asking them to write down what they heard. Those who failed to produce100% accuracy in this test were excluded from the evaluation experiment. Other participants were removedbecause their response were inconsistent. For example, some of them rated the quality of some syntheticspeech higher than the natural speech. As a result, a total of seven participants were removed and 12 partic-ipated in the final qualitative evaluations. Each of the 12 participants took about 45 min to complete the eval-uation. The intelligibility evaluation was done first, and after a 5 min break the naturalness evaluationfollowed.

In carrying out the qualitative evaluations, we used two kinds of stimuli: modified and unmodified (Wu andChen, 2001; Sakurai et al., 2003). The unmodified stimuli are naturally produced utterances recorded withoutany modification to the acoustic data. The modified stimuli are versions of the same naturally produced utter-ances in which the duration data has been replaced by those generated by our FDT and CART based models.The duration tier manipulation for the modified stimuli was achieved using the Praat speech processing soft-ware. In all, 90 stimuli were created, 30 for each of the natural (unmodified), the FDT model and the CARTmodel. The Pitch Synchronous Overlap (PSOLA) method was then used to synthesise the utterance (Moulinesand Charpentier, 1990).

8.3. Intelligibility evaluation

For our intelligibility test, only the speech sound synthesised using computed syllable durations were playedto each participant. After a speech sound is played, the participant is asked to repeat what they heard. Thetranscription error, in terms of the number of syllables in the original sentences that are wrongly identifiedby the participants are recorded. Our intelligibility evaluation is very rigorous in that, the participants were

Table 9Results for the intelligibility evaluation

Data set Duration models Intelligibly score Significance

Training set FDT 4.50 (0.63) Not Significant (p > 0.05)CART 3.80 (0.67)

Test set FDT 4.10 (0.71) Not Significant (p > 0.05)CART 3.60 (0.58)

Standard deviations are shown in parentheses.


not only required to identify the tones on each synthesised utterance, they were also required to accuratelyidentify the syllables associated with each tone. We then obtained the transcription error using Eq. (7).

TableQualit

Value

54321

TableResult

Compa

Natura

Natura

FDT V

Intelligibility ¼ T All � T Wrong

T All

� �� 5:0 ð7Þ

where TAll is the total number of syllables in a sentence and TWrong is the number of syllables that had beenwrongly identified.

The results of the intelligibility tests are shown in Table 9. For the FDT-based duration model, a transcrip-tion accuracy of 4.50 (SD 0.63) was obtained for the training set. For the test set, the transcription accuracy is4.10 (SD 0.71). The CART-based duration model produced a transcription accuracy of 3.60 (SD 0.58) for thetest set and transcription accuracy of 3.80 (SD 0.67) for the training set.

We used the sign test (Anderson et al., 2002; Rana et al., 2005) to assess the statistical difference in the per-ceived quality of the synthetic speech produced by the two models. The results show that the intelligibilityscores for these duration models are not significantly different (p > 0.05). These results indicate that the listen-ers do not have preference for the intelligibility of the synthetic speech produced using the FDT based modelover the CART based model.

8.4. Naturalness evaluation

For the naturalness test, the participants were asked to rank the naturalness of the utterance using a scale of1–5 as shown in Table 10. The results for the training set (see Table 11) show that the naturalness quality ofthe unmodified speech, with an MOS score of 5.0, is higher than that of the CART (3.4) and FDT (3.7). A signtest shows that the unmodified speech is highly preferred (p 6 0.001) by the listeners when compared with thespeech generated using the CART based duration model.

Although the FDT has a higher MOS score, the naturalness quality of speech synthesised using the FDTbased duration model is not significantly (p > 0.05) better than that of the CART model. This indicates that,for the training data set, there is no evidence that the listeners preferred the synthetic speech generated usingthe FDT based duration model over that generated using the CART based model.

A similar pattern is repeated for the test set when the naturalness of the unmodified speech is comparedwith that of the synthesised speech produced by the two duration models (see Table 12). However, the natu-ralness of the synthetic speech generated by the FDT model is significantly (p 6 0.05) more preferred over that

10ative evaluation scores

Description

Perfect, indistinguishable from natural speech qualityVery goodAveragePoorWeak or not acceptable

11s for naturalness evaluation (Training set)

rison Duration model MOS score Significance

l Vs. CART Natural 5.0 (0.001) Significant (p 6 0.001)CART 3.4 (0.440)

l Vs. FDT Natural 5.0 (0.001) Significant (p 6 0.001)FDT 3.7 (0.130)

s. CART FDT 3.7 (0.130) Not significant (p > 0.05)CART 3.4 (0.440)

Table 12Results for naturalness evaluation (Test set)

Comparison Duration model MOS score Significance

Natural Vs. CART Natural 4.9 (0.01) Significant (p 6 0.001)CART 3.1 (0.54)

Natural Vs. FDT Natural 4.8 (0.02) Significant (p 6 0.001)FDT 3.4 (0.39)

FDT Vs. CART FDT 3.4 (0.39) Significant (p < 0.05)CART 3.1 (0.54)


of the CART model. We can therefore conclude that, for the test data set, the synthetic speech generated usingthe FDT based duration model is relatively more natural.

8.5. Discussion

The training and test data sets contain similar combinations of factors that affect duration (cf. Section 3).Our evaluation results show that, when compared with our FDT model, the CART model performs better forthe training data set in the quantitative evaluation since it has a higher correlation and lower RMSE. Thatimplies that CART models the duration data in the training data set more accurately. However, its lower accu-racy for modelling the test data set suggests that FDT is more capable of extrapolating from the training datato a new or unknown data set. On the other hand, our qualitative evaluations show that the FDT model per-formed better than the CART model in the training data set (although not significantly) and performed mar-ginally better than the CART on test data set (cf. Tables 11 and 12).

Our results are in line with the findings that the ‘‘objective’’ performance statistics (i.e. RMSE and corre-lation) do not always predict the ‘‘subjective’’ perception judgements correctly (Brinckmann and Trouvain,2003). The result of our analysis can be interpreted from two perspectives. First, it is well known that CARTis very good at modelling training data accurately, but poor at extrapolating to unknown data (Riley, 1992).The results of our quantitative evaluations confirm this fact.

Second, based on our results, we can speculate that FDT is able to capture some salient aspects of the dura-tion data that have greater perceptual significance. That speculation is based on the fact that our FDT modelexploits linguistically meaningful terms in the partitioning of the affecting factors for the duration variables.These linguistic terms are determined based on a subjective treatment of how the duration variables influencethe perception of the speech sound. This leads to an inclusion of factors according to their degree of relevanceon the overall duration pattern predicated by the FDT based model. Hence, all of the potential factors thataffect duration were taken into account with different weights in the interval [0.0, 1.0]. On the other hand,CART performs a binary partitioning of the variables that represent the factors affecting duration. Thatresults in an ‘‘all-or-nothing’’ situation whereby a factor is either included or rejected.

Furthermore, our corpus only contains short to medium length SY sentences (i.e. sentences with 6–24 syl-lables). Our results indicate that FDT performs slightly better than CART in modelling short to mediumlength SY sentences. Nonetheless, the present investigation is biased towards that kind of sentences. Basedon a comparison with the KLATT duration model, the CART model has a tendency to perform better whenmodelling duration for long sentences (i.e. >30 syllables) (Brinckmann and Trouvain, 2003), our evaluationresults may be different when long sentences are considered.

9. Conclusion

We have presented duration models based on a Fuzzy Decision Tree (FDT) and a Classification AndRegression Tree (CART) in the context of prosody modelling for SY text-to-speech synthesis. Since the dura-tion modelling is syllable-based, we first carried out a set of exploratory analytical experiments to determinethe most important factors affecting the duration of SY syllables. The results of that experiment led to theselection of seven duration factors which were then used to produce the duration models. The duration of


citation versus contextual syllables are used to predict a scale factor by which the duration of a citationsyllable will be multiplied to reflect the perceptual quality of its contextual equivalent.

Results of our qualitative and quantitative evaluations show that CART models the training data moreaccurately than FDT. The FDT model, however, is better in extrapolating from the training data since it pro-duced a better accuracy for the test data set. Synthesised speech produced by the FDT duration model wasalso ranked better on quality than the CART model. These results confirmed the well-known fact that CARTpossesses very good interpolation but poor extrapolation capabilities (Breiman et al., 1984; Barbosa andBailly, 1994). The good extrapolation capability of FDT makes it an ideal model for implementing durationfor TTS application due to the sparseness of duration data.

We also observed that the expressiveness of FDT is better than that of CART. This is because the repre-sentation in FDT is not restricted to a set of piece-wise or discrete constant approximation. In addition, fuzz-ification of the input data imposes a continuity constraint at the boundaries of node splits in the FDT. Thisacts as a mechanism to limit the degree of overfitting of the FDT. Furthermore, fuzzification and global opti-misation provide a continuous representation with the flexibility necessary to reproduce duration pattern at afiner granularity.

According to our qualitative and quantitative evaluations, CART produces better objective results thanFDT, but FDT produces non-significantly better subjective results. This shows that neither model is preciseenough to distinguish each other. One may therefore speculate that the presented results perhaps imply thatboth modelling techniques are not well correlated with subjective scores. However, further work is required toconfirm this speculation.

We can conclude, therefore, that the resulting fuzzy decision trees exhibit high comprehensibility, and thatfuzzy set and approximate reasoning methods provide a natural means to deal with continuous domains, sub-jective linguistic terms as well as noisy data which are the characteristics of the duration dimension of speechsignal. When compared with the CART-based approach, our FDT-based duration model captures some sali-ent aspect of the speech signal that has more perceptual significance. In this regard, the FDT model is moreappropriate for modelling duration in the context of TTS application for the Standard Yoruba language.

Arguably, the data set used in the present study is relatively small and our corpus contains relatively shortstatement sentences. However, our data contain all important duration affecting factors. Furthermore, this is apreliminary investigation. In order to carry out more extensive analysis and develop a more robust model, weplan to extend the scope of the sentences to include other domains and sentence modes in future.

References

Akinlabı, A., 1993. Underspecification and phonology of Yoruba /r/. Linguistic Inquiry 24 (1), 139–160.Allen, J.B., 1994. How do humans process and recognize speech?. IEEE Transactions on Speech & Audio Processing 2 (4) 567–577.Anderson, D.R., Sweeney, D.J., Williams, T.A. 2002. Statistics for bussiness and economics, 8th ed., South-western, United Kingdom.Bamgbos: e, A., 1990. Fon�o: l�o: jı ati Gırama Yoruba. University Press PLC, Ibadan.Barbosa, P.A., Bailly, G., 1994. Characterisation of rhythmic patterns for text-to-speech synthesis. Speech Communication 15, 127–137.Batusek, R. 2002. A duration model for Czech text-to-speech synthesis. In: Proceedings of the first International Conference on Speech

Prosody, Aix-en-Provence, pp. 167–170. Available: from http://www.ipl.univ-aix.fr/sp2002/pdf/bastusek.pdf. Visited: Sep 2004.Bellegarda, J.R., Silverman, K.E.A., Lenzo, K., Anderson, V., 2001. Statistical prosody modelling: from corpus design to parameter

estimation. IEEE Transactions on Speech & Audio Processing 9 (1), 52–66.Black, A., Clark, R., King, S., Heiga, Z., Taylor, P., Caley, R. 1999, The Festival speech synthesis system: system documentation, version

1.4.0. Available from: http://www.cstr.ed.ac.uk/projects/festival/manual/festival-25.html#SEC112. Visited: Apr 2004.Boersma, P., Weenink, D., 2004. Praat, doing phonetics by computer. Available from: http://www.fon.hum.uva.nl/praat/. Visited: Mar

2004.Boyen, X., Wehenkel, L., 1999. Automatic induction of fuzzy decision trees and its application to power systems’ security assessment.

Fuzzy Sets & Systems 102, 3–19.Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 1984. Classification and Regression Tree. Wadworth, CA, USA.Brinckmann, C., Trouvain, J., 2003. The role of duration models and symbolic representation for timing in synthetic speech. International

Journal of Speech Technology 6, 21–31.Campbell, N., 2000. Timing in speech: a multilevel process. In: Prosody: Theory and Experiment. Kluwer, Dordrecht, pp. 281–334.Campbell, N., Isard, S.D., 1991. Segmental durations in a syllable frame. Journal of Phonetics 19 (1), 37–47.Carvalho, D.R., Freitas, A.A., 2002. A genetic-algorithm for discovering small-disjunct rules in data mining. Applied Soft Computing 2,

75–88.

http://www.ipl.univ-aix.fr/sp2002/pdf/bastusek.pdf

http://www.cstr.ed.ac.uk/projects/festival/manual/festival-25.html#SEC112

http://www.fon.hum.uva.nl/praat/


Chen, S.-H., Hwang, S.-H., Wang, Y.-R., 1998. An RNN-based prosodic information synthesiser for Mandarin text-to-speech. IEEETransactions on Speech & Audio Processing 6 (3), 226–239.

Chen, S.-H., Lai, W.H., Wang, Y.-R., 2003. A new duration modelling approach for Mandarin speech. IEEE Transactions on Speech &Audio Processing 11 (4), 308–320.

Chung, H. 2002. Duration models and the perceptual evaluation of spoken Korean. In: International Conference on Speech Prosody, Aix-en-Provence, France, pp. 219–222.

Chung, H., Huckvale, M. 2001. Linguistic factors affecting timing in Korean with application to speech synthesis. In: Proceedings ofEuroSpeech’01, Aalborg, Denmark, pp. 815–818.

Clark, R.A.J., Dusterhoff, K.E. 1999. Objective methods for evaluating synthetic intonation. In: Proceedings of the Sixth EuropeanConference on Speech Communication Technology, vol. 4, Budapest, pp. 1623–1626.

Connell, B., Ladd, D.R., 1990. Aspect of pitch realisation in Yoruba. Phonology 7, 1–29.Crozier, D.H., Blench, R.M., 1976. An Index of Nigerian Languages, second ed. Summer Institute of Linguistics, Dallas.d’Alessandro, C., Mertens, P., 1995. Automatic pitch contour stylization using a model of tonal perception. Computer Speech & Language

9, 257–288.Dong, M., Kothari, R., 2001. Look-ahead based fuzzy decision tree induction. IEEE Transactions on Fuzzy Systems 9 (3), 461–468.Donovan, R.E. 1996. Trainable Speech Synthesis, PhD thesis, Cambridge University Engineering Department, Cambridge.Donovan, R.E., 2003. ‘Topics in decision tree based speech synthesis. Computer Speech & Language 17, 43–67.Ehrich, R.W., Foith, J.P., 1976. Representation of random waveforms by relational trees. IEEE Transactions on Computers C-25 (7),

725–736.Fletcher, J., McVeigh, A., 1993. Segment and syllable duration in Australian English. Speech Communication 13, 355–365.Goubanova, O., Taylor, P. 2000. Using Bayesian belief networks for model duration in text-to-speech systems. In: Proceedings of

ICSLP2000.Hermes, D.J., 1998. Measuring the perceptual similarity of pitch contour. Journal of Speech Language and Hearing Research 41, 73–82.Hohne, H.D., Coker, C., Levinson, S.E., Rabiner, L.R., 1983. On the temporal alignment of sentence of natural and synthetic speech.

IEEE Transactions on Speech & Audio Processing ASPP-31 (4), 807–813.Huang, H.-P., Liang, C.-C., 2002. Strategy-based decision making of a soccer robot system using a real-time self-organising fuzzy decision

tree. Fuzzy Sets & Systems 127, 49–64.Huckvale, M. 2002. Speech synthesis, speech simulation and speech science. In: Proceedings of the International Conference on Speech

and Language Processing, Denver, pp. 1261–1264.Janikow, C.Z., 1998. Fuzzy decision trees: issues and methods. IEEE Transactions on Systems, Man, & Cybernetics 28 (1), 1–14.Janikow, C.Z. (2004), FID33 fuzzy decision tree. Available from: http://www.cs.umsl.edu/~janikow/fid/fid32/overview.htm. Visited: Jan

2005.Keller, E., Zellner, B. 1995. A statistical timing model for French, In: XIIIeme Cong. Int. Des. Sci. Phon., vol. 3, Stockholm, pp. 302–305.Klatt, D.H., 1987. Review of text-to-speech conversion for English. Journal of the Acoustical Society of America 82 (3), 737–793.Kosko, B., 1994. Fuzzy systems as universal approximators. IEEE Transactions on Computers 43 (11), 1329–1333.Ladd, D.R., 2000. Tones and turning points: Bruce, Pierrehumbert, and the elements of intonation phonology. In: Horne, M. (Ed.),

Prosody: Theory and Experiment – Studies presented to Gosta Bruce. Kluwer, Dordrecht, pp. 37–50.Lee, S., Oh, Y.-H., 1999. Tree-based modelling of prosodic phrasing and segmental duration for Korean TTS systems. Speech

Communication 28 (4), 283–300.Levinson, S.E. 1986. Continuously variable duration Hidden Markov Models for speech analysis. In: Proceedings of IEEE ICASSP, pp.

1241–1244.Lin, C.-H., Wu, R.-C., Chang, J.-Y., Liang, S.-F., 2003. A novel prosodic-information synthesizer based on recurrent fuzzy neural

networks for Chinese TTS system. IEEE Transactions on Systems, Man, & Cybernetics B, 1–16.Minematsu, N., Kita, R., Hirose, K., 2003. Automatic estimation of accentual attribute values of words for accent sandhi rules of

Japanese text-to-speech conversion. IEICE Transactions on Information & Systems E86-D (3), 550–557.Mitaim, S., Kosko, B., 2001. The shape of fuzzy sets in adaptive function approximation. IEEE Transactions on Fuzzy Systems 9 (4), 637–

656.Mitra, S., Konwar, K.M., Pal, S.K., 2002. Fuzzy decision tree, linguistic rules and fuzzy knowledge-based network: generation and

evaluation. IEEE Transactions on Systems, Man, & Cybernetics 32 (4), 328–339.Mobius, B., 2003. Rare events and closed domains: Two delicate concepts in speech synthesis. International Journal of Speech Technology

6, 57–71.Moulines, E., Charpentier, F., 1990. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones.

Speech Communication 9 (5-6), 453–467.Ogunb:wale, P.O., 1966. Asa Ibı Yoruba. University Press Limited, Jericho, Ibadan, Nigeria.Olaru, C., Wehenkel, L., 2003. A complete fuzzy decision tree technique. Fuzzy Sets & Systems 138, 221–254.Owolabı, K., 1998. Ijınl�e: Itupal�e: ede Yoruba: Fon�e: tııkıati fon�o: l�o: jı, first ed. Onibonoje Press & Book Industries (Nig.) Ltd., Ibadan.O: d�e: jo:bı, O.A., Beaumont, A.J., Wong, S.H.S., 2004a. A computational model of intonation for Yoruba text-to-speech synthesis: design

and analysis. In: Sojka, P., Kopecek, I., Pala, K. (Eds.), Lecture Notes in Artificial Intelligence, Lecture Notes in Computer Science(LNAI 3206). Springer–Verlag, Berlin, pp. 409–416.

O: d�e: jo:bı, O.A., Beaumont, A.J., Wong, S.H.S., 2004b. Experiments on stylisation of standard Yoruba language tones, Technical ReportKEG/2004/003. Aston University, Birmingham.

http://www.cs.umsl.edu/~janikow/fid/fid32/overview.htm


Pedrycz, W., Sosnowski, Z.A., 2001. The design of decision trees in the framework of granular data and their application to softwarequality models. Fuzzy Sets & Systems 123, 271–290.

Quinlan, J.R., 1986. Induction on decision trees. Machine Learning 1, 81–106.Rana, D.S., Hurst, G., Shepstone, L., Pilling, J., Cockburn, J., Crawford, M., 2005. Voice recognition for radiology reporting: is it good

enough? Clinical Radiology 60, 1205–1212.Riley, M.D., 1992. Tree-based modelling of segmental durations. In: Bailly, G., Benoit, C., Sawallis, T.R. (Eds.), Talking Machines:

Theories, Models and Designs. Elsevier, Amsterdam, pp. 265–273.Sakurai, A., Hirose, K., Minematsu, N., 2003. Data-driven generation of f0 contours using a superpositional model. Speech

Communication 40 (4), 535–549.Shen, X.S., Lin, M., Yan, J., 1993. f0 turning point as an f0 cue to tonal contrast: a case study of Mandarin tones 2 and 3. Journal of the

Acoustical Society of America 93 (4), 2241–2243.Sjolander, K., Beskow, J. 2004. Wavesurfer 1.7. Available from: http://www.speech.kth.se/wavesurfer/. Visited: Jun 2004.Suarez, A., Lutsko, J.F., 1999. Globally optimal fuzzy decision trees for classification and regression. IEEE Transactions on Pattern

Analysis & Machine Intelligence 21 (12), 1297–1311.Taylor, C. 2000. Typesetting African languages. Available from: http://www.ideography.co.uk/library/afrolin gua.html. Visited: Apr

2004.Vainio, M. 2001. Artificial neural network based prosody models for Finnish text-to-speech synthesis, PhD thesis, Department of

Phonetics, University of Helsinki, Helsinki.van Santen, J.P.H., 1992. Contextual effects on vowel duration. Speech Communication 11 (6), 513–546.van Santen, J.P.H., 1994. Assignment of segmental duration in text-to-speech synthesis. Computer Speech & Language 8, 95–128.Wu, C.-H., Chen, J.-H., 2001. Automatic generation of synthesis units and prosody information for Chinese concatenative synthesis.

Speech Communication 35, 219–237.Yuan, Y., Shaw, M.J., 1995. Induction of fuzzy decision trees. Fuzzy Sets & Systems 96, 125–139.

http://www.speech.kth.se/wavesurfer/

http://www.ideography.co.uk/library/afrolingua.html

Documents

[email protected]