1
PREDICTION OF CREAKY VOICE FROM CONTEXTUAL FACTORS Thomas Drugman TCTS Lab University of Mons, Belgium John Kane Phonetics and Speech Laboratory Trinity College Dublin, Ireland Tuomo Raitio Department of Signal Processing and Acoustics Aalto University, Espoo, Finland Summary: Creaky voice is voice quality frequently produced in many languages The analysis shows that a few contextual factors, related to speech production preceding a silence or a pause, are of particular interest for prediction Four prediction methods based on training and generating a creaky probability stream with HMMs are compared on US English and Finnish speakers The best prediction technique performs comparable to the creaky detection al- gorithm on which HMMs were trained 1 Introduction Creaky voice (or vocal fry) is a voice quality frequently pro- duced in many languages. In order to enhance the natural- ness of speech synthesis, a proper use of creaky voice should be included. The goal of this paper is two-fold: 1. Analyse how informative contextual factors are in term of predicting creaky voice 2. Investigate various creaky voice prediction sche- mes based on HMMs 2 Creaky voice detection An automatic creaky voice detection technique, described in [1] is used. The algorithm involves the use of two acoustic features which characterise two different aspects of the creaky excitation: – H2–H1 of a resonator output – Residual peak prominence These features are used as part of a decision tree classifier for binary creaky decision. 3 Analysis of contextual features related to creaky voice The aim here is to investigate: 1) Which contextual factors are the most relevant in predicting creaky usage 2) To what extent contextual factors can be useful for the prediction For US English, the standard complete list of 53 contextual factors in the HTS implementation [2, 3] are used, relating to phoneme, syllable, word, phrase, utterance type and position. The predictability power of each contextual factor was assessed based on its mutual information (MI) with the creaky use decisions. Only 13 contextual factors are found to have interesting normalised MI values higher than 15%. The contextual factors are closely related with creaky use at the end of a sentence or a word group. Examples of contextual factors: • Phoneme {preceding,current,succeeding} phoneme position of current phoneme in current syllable • Syllable no. of phonemes at {preceding,current,succeeding} syllable accent of {preceding,current,succeeding} syllable stress of {preceding,current,succeeding} syllable position of current syllable in word • Word part of speech of {preceding, current, succeeding} word number of syllables in {preceding, current, succeeding} word position of current word in current phrase number of words {from previous, to next} content word • Phrase: number of syllables in {preceding, current, succeeding} phrase • Utterance: number of syllables in current utterance 4 Creaky voice prediction methods based on HMMs Four different creaky voice prediction methods are experimented with. The creaky voice related features are trained along with the conventional HMM-based synthesis features, F 0 and spectrum: I. PredictedFeat The two features given by the creaky detection algorithm are trained in two separa- te streams, after which the prediction is drawn from the decision tree used in de- tection method. II. Binary The bina- ry decision output of the creaky detection algorithm is trained in continuous stream. The final decision is made by thresholding the trained probability with a pre-specified value. III. Binary MSD The binary decision output of the creaky detection algo- rithm is trained in multi- space probability distribu- tion stream, aligned with F 0 . The final decision is the stream output. IV. PosteriorProb The posterior probability given by the creaky voice detec- tion algorithm is trained in a continuous stream. The fi- nal decision is made by th- resholding the trained pro- bability. BDL MV 0 10 20 30 40 50 60 70 80 TPR (%) BDL MV 0 0.5 1 1.5 2 2.5 3 3.5 FPR (%) BDL MV 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 PredictedFeat Binary Binary MSD PosteriorProb F1 score 5 Results The methods are tested on US English (BDL) and Finnish (MV) databases. Frame-level metrics, true positive rate (TPR or recall), false positive rate (FPR), and F1 score are evaluated. Across both speakers and all metrics, the Posterior- Prob method gives the best performance. Database Method Misses FAs Hits BDL PredictedFeat 66 24 98 Binary 68 19 96 Binary MSD 68 17 96 PosteriorProb 40 37 124 MV PredictedFeat 54 29 109 Binary 46 25 117 Binary MSD 47 24 116 PosteriorProb 28 39 135 6 Conclusions Firstly, it has been investigated how how contextual information is related to the use of creaky voice. Contextual factors linked to speech production prece- ding a silence or a pause appears to be highly relevant, leading to normalised mutual information values up to 32%. This confirms that vocal fry has a syn- tactic role by making a better delimitation of groups of words and by making phrase segmentation easier. In the second experiment, four methods are proposed to predict the use of creaky voice based on HMMs. It is shown that modelling the posterior probability given by the detection algorithm leads to the best results across all metrics. This technique achieves performance scores comparable to the determination rates obtained by the detection method on which it is trained. References [1] Kane, J., Drugman, T., Gobl, C., “Improved automatic detec- tion of creak”, Computer Speech and Language, 27(4):1028– 1047, 2013. [2] Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., Black, A. and Tokuda, K., “The HMM-based speech synthesis sys- tem (HTS) version 2.0”, in Sixth ISCA Workshop on Speech Synthesis, Aug. 2007, pp. 294–299. [3] [Online] “HMM-based speech synthesis system”, http://hts.sp.nitech.ac.jp. Acknowledgements This research was supported by FNRS, the Science Foundation Ireland (FastNet), the Irish Department of Arts, Heritage and the Gaeltacht (ABAIR project), the EU’s FP7 project Simple4All, Academy of Finland, and Aalto University’s MIDE UI-ART project. Contact [email protected], [email protected], tuomo.raitio@aalto.fi

PREDICTION OF CREAKY VOICE FROM …...PREDICTION OF CREAKY VOICE FROM CONTEXTUAL FACTORS Thomas Drugman TCTS Lab University of Mons, Belgium John Kane Phonetics and Speech Laboratory

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PREDICTION OF CREAKY VOICE FROM …...PREDICTION OF CREAKY VOICE FROM CONTEXTUAL FACTORS Thomas Drugman TCTS Lab University of Mons, Belgium John Kane Phonetics and Speech Laboratory

PREDICTION OF CREAKY VOICE FROM CONTEXTUAL FACTORSThomas DrugmanTCTS LabUniversity of Mons, Belgium

John KanePhonetics and Speech LaboratoryTrinity College Dublin, Ireland

Tuomo RaitioDepartment of Signal Processing and AcousticsAalto University, Espoo, Finland

Summary:◦ Creaky voice is voice quality frequently produced in many languages◦ The analysis shows that a few contextual factors, related to speech productionpreceding a silence or a pause, are of particular interest for prediction◦ Four prediction methods based on training and generating a creaky probabilitystream with HMMs are compared on US English and Finnish speakers◦ The best prediction technique performs comparable to the creaky detection al-gorithm on which HMMs were trained

1 Introduction

Creaky voice (or vocal fry) is a voice quality frequently pro-duced in many languages. In order to enhance the natural-ness of speech synthesis, a proper use of creaky voice shouldbe included. The goal of this paper is two-fold:

1. Analyse how informative contextual factors are interm of predicting creaky voice

2. Investigate various creaky voice prediction sche-mes based on HMMs

2 Creaky voice detection

An automatic creaky voice detection technique,described in [1] is used. The algorithm involves theuse of two acoustic features which characterise twodifferent aspects of the creaky excitation:

– H2–H1 of a resonator output

– Residual peak prominence

These features are used as part of a decision treeclassifier for binary creaky decision.

3 Analysis of contextual features related to creaky voice

The aim here is to investigate:1) Which contextual factors are the most relevant in predicting creaky usage2) To what extent contextual factors can be useful for the prediction

For US English, the standard complete list of 53 contextual factors in the HTS implementation [2, 3] are used, relatingto phoneme, syllable, word, phrase, utterance type and position. The predictability power of each contextual factor wasassessed based on its mutual information (MI) with the creaky use decisions. Only 13 contextual factors are found tohave interesting normalised MI values higher than 15%. The contextual factors are closely related with creaky use atthe end of a sentence or a word group.

Examples of contextual factors:• Phoneme

– {preceding,current,succeeding} phoneme– position of current phoneme in current syllable

• Syllable– no. of phonemes at {preceding,current,succeeding} syllable– accent of {preceding,current,succeeding} syllable– stress of {preceding,current,succeeding} syllable– position of current syllable in word

• Word– part of speech of {preceding, current, succeeding} word– number of syllables in {preceding, current, succeeding} word– position of current word in current phrase– number of words {from previous, to next} content word

• Phrase:– number of syllables in {preceding, current, succeeding} phrase

• Utterance:– number of syllables in current utterance

4 Creaky voice prediction methods based on HMMs

Four different creaky voice prediction methods are experimented with. The creaky voice related features are trained alongwith the conventional HMM-based synthesis features, F0 and spectrum:

I. PredictedFeat Thetwo features given by thecreaky detection algorithmare trained in two separa-te streams, after which theprediction is drawn fromthe decision tree used in de-tection method.

II. Binary The bina-ry decision output of thecreaky detection algorithmis trained in continuousstream. The final decisionis made by thresholding thetrained probability with apre-specified value.

III. Binary MSD Thebinary decision output ofthe creaky detection algo-rithm is trained in multi-space probability distribu-tion stream, aligned withF0. The final decision is thestream output.

IV. PosteriorProb Theposterior probability givenby the creaky voice detec-tion algorithm is trained ina continuous stream. The fi-nal decision is made by th-resholding the trained pro-bability.

BDL MV0

10

20

30

40

50

60

70

80

TPR

(%)

BDL MV0

0.5

1

1.5

2

2.5

3

3.5

FPR

(%)

BDL MV0

0.10.20.30.40.50.60.70.80.9

PredictedFeatBinaryBinary MSDPosteriorProbF1

scor

e

5 Results

The methods are tested on US English (BDL) and Finnish (MV) databases.Frame-level metrics, true positive rate (TPR or recall), false positive rate (FPR),and F1 score are evaluated. Across both speakers and all metrics, the Posterior-Prob method gives the best performance.

Database Method Misses FAs Hits

BDL

PredictedFeat 66 24 98Binary 68 19 96

Binary MSD 68 17 96PosteriorProb 40 37 124

MV

PredictedFeat 54 29 109Binary 46 25 117

Binary MSD 47 24 116PosteriorProb 28 39 135

6 Conclusions

Firstly, it has been investigated how how contextual information is related tothe use of creaky voice. Contextual factors linked to speech production prece-ding a silence or a pause appears to be highly relevant, leading to normalisedmutual information values up to 32%. This confirms that vocal fry has a syn-tactic role by making a better delimitation of groups of words and by makingphrase segmentation easier.

In the second experiment, four methods are proposed to predict the useof creaky voice based on HMMs. It is shown that modelling the posteriorprobability given by the detection algorithm leads to the best results acrossall metrics. This technique achieves performance scores comparable to thedetermination rates obtained by the detection method on which it is trained.

References

[1] Kane, J., Drugman, T., Gobl, C., “Improved automatic detec-tion of creak”, Computer Speech and Language, 27(4):1028–1047, 2013.

[2] Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., Black,A. and Tokuda, K., “The HMM-based speech synthesis sys-tem (HTS) version 2.0”, in Sixth ISCA Workshop on SpeechSynthesis, Aug. 2007, pp. 294–299.

[3] [Online] “HMM-based speech synthesis system”,http://hts.sp.nitech.ac.jp.

Acknowledgements This research was supported by FNRS, the Science Foundation Ireland (FastNet), the Irish Department of Arts, Heritageand the Gaeltacht (ABAIR project), the EU’s FP7 project Simple4All, Academy of Finland, and Aalto University’s MIDE UI-ART project.Contact [email protected], [email protected], [email protected]