53
02/07/22 HMM-based speech synthesis: the new generation of artificial voices Thomas Drugman [email protected]

HMM-based speech synthesis: the new generation of artificial voices

  • Upload
    eben

  • View
    44

  • Download
    5

Embed Size (px)

DESCRIPTION

HMM-based speech synthesis: the new generation of artificial voices. Thomas Drugman [email protected]. TCTS Lab. « Laboratoire de Théorie des Circuits et de Traitement du Signal » 25 people : 3 Profs, 10 PhD Students. TCTS Lab. Image & Video. Numerical Arts. Audio & Speech. - PowerPoint PPT Presentation

Citation preview

Page 1: HMM-based speech synthesis: the new generation of artificial voices

20/04/2320/04/23

HMM-based speech synthesis: the new

generation of artificial voices

Thomas [email protected]

Page 2: HMM-based speech synthesis: the new generation of artificial voices

22Drugman ThomasDrugman Thomas

TCTS Lab

« Laboratoire de Théorie des Circuits et de Traitement du Signal »

25 people : 3 Profs, 10 PhD Students

Audio& Speech

Image& Video

NumericalArts

TCTS Lab

Page 3: HMM-based speech synthesis: the new generation of artificial voices

33

Content

• Speech synthesis: history

• HMM-based speech synthesis

• Parametric modeling of speech

• Statistical generation

• Conclusions

Page 4: HMM-based speech synthesis: the new generation of artificial voices

44

Content

• Speech synthesis: history

• HMM-based speech synthesis

• Parametric modeling of speech

• Statistical generation

• Conclusions

Page 5: HMM-based speech synthesis: the new generation of artificial voices

55Drugman ThomasDrugman Thomas

Speech Synthesis

« Hello »Text-to-speech

system

GOAL :

Produce the lecture of an unknown text typed by the user

Page 6: HMM-based speech synthesis: the new generation of artificial voices

66Drugman ThomasDrugman Thomas

Challenges

Naturalness

Intelligibility

Cost-effectiveness

Expressivity

Page 7: HMM-based speech synthesis: the new generation of artificial voices

77

Challenge 3 : Cost-effectivenessChallenge 3 : Cost-effectiveness

Industry expects Intelligibility + Naturalness + …

Small footprint : a few Megs Small CPU requirements (embedded market) Easy extension to other languages Possibility to create new voices as fast as

possible• Through automatic recording/segmentation

process• Through efficient voice conversion

Possibility to bootstrap an existing TTS voice into any voice

Drugman ThomasDrugman Thomas

Page 8: HMM-based speech synthesis: the new generation of artificial voices

88

Challenge 4 (new) : ExpressivityChallenge 4 (new) : Expressivity

=“Emotional speech synthesis” (art!)

1. Being able to render an expressive voice• In terms of prosody• In terms of voice quality

2. Knowing when to do it (yet unsolved)

Today’s holy grail for the industry• Strategic advantage for whoever gets it first• News markets (ebooks?)

Drugman ThomasDrugman Thomas

Page 9: HMM-based speech synthesis: the new generation of artificial voices

99Drugman ThomasDrugman Thomas

Methods for Speech Synthesis

Expert-based (rule-based) approach

Corpus-based approach

• Diphone concatenation

• Unit Selection

• Statistical parametric synthesis (“HMM-based synthesis”)

Page 10: HMM-based speech synthesis: the new generation of artificial voices

1010Prof. Thierry DutoitProf. Thierry Dutoit

Von Kempelen’s talking machine (1791)

Mouth

Nostrils

Main bellows

Small bellows

'S' pipe

'Sh' pipe

'Sh' lever'S' lever

Page 11: HMM-based speech synthesis: the new generation of artificial voices

1111Prof. Thierry DutoitProf. Thierry Dutoit

Omer Dudley’s Voder (Bell Labs, 1936)

NoiseSource

Oscillator

Resonnance Control Amplifier

106 7 8

9

"Quiet"

t-dp-b

k-g

Energy switchwrist bar

VoderConsoleKeyboard

12 3 4

5

Pitch-controlpedal

UV

V

Page 12: HMM-based speech synthesis: the new generation of artificial voices

And other developments in articulatory synthesis

Work by :K. Stevens, G. Fant, P. Mermelstein, R. Carré (GNUSpeech), S. Maeda, J. Shroeter & M. Sondhi…

More recently : O. Engwall, S. Fels (ArtiSynth), Birkholz and Kröger, A. Alwan & S. Narayanan (MRI)…

1212Prof. Thierry DutoitProf. Thierry Dutoit

Page 13: HMM-based speech synthesis: the new generation of artificial voices

1313Prof. Thierry DutoitProf. Thierry Dutoit

Rule-based synthesis

Intelligibility Naturalness Mem/CPU/Voices Expressivity

Page 14: HMM-based speech synthesis: the new generation of artificial voices

1414Drugman ThomasDrugman Thomas

Methods for Speech Synthesis

Expert-based (rule-based) approach

Corpus-based approach

• Diphone concatenation

• Unit Selection

• Statistical parametric synthesis (“HMM-based synthesis”)

Page 15: HMM-based speech synthesis: the new generation of artificial voices

1515

Diphone concatenation

Intelligibility Naturalness~ Mem/CPU/Voices Expressivity

Page 16: HMM-based speech synthesis: the new generation of artificial voices

1616

Unit selection

Intelligibility Naturalness Mem/CPU/Voices ~ Expressivity ~

Page 17: HMM-based speech synthesis: the new generation of artificial voices

1717

Content

• Speech synthesis: history

• HMM-based speech synthesis

• Parametric modeling of speech

• Statistical generation

• Conclusions

Page 18: HMM-based speech synthesis: the new generation of artificial voices

1818

Statistical Parametric Speech Synthesis

DATABASESpeech

Parameters

SpeechParameters

SPSSynthesizer

SpeechProcessing

SpeechAnalysis

StatisticalModeling

StatisticalGeneration

TRAININGSYNTHESIS

Hello!« Hello !»

Page 19: HMM-based speech synthesis: the new generation of artificial voices

1919

HMM-based speech synthesis

Intelligibility Naturalness ? Mem/CPU/Voices Expressivity ?

http://hts.sp.nitech.ac.jp/

Page 20: HMM-based speech synthesis: the new generation of artificial voices

2020

TRAININGOF THE HMM-BASED

SYNTHESIZER

Page 21: HMM-based speech synthesis: the new generation of artificial voices

2121

Parameter extraction

Page 22: HMM-based speech synthesis: the new generation of artificial voices

2222

Parameter extraction

Pulsetrain

White noise

Filter SyntheticSpeech

Page 23: HMM-based speech synthesis: the new generation of artificial voices

2323

Labels

Page 24: HMM-based speech synthesis: the new generation of artificial voices

2424

Labels

Labels consist of phonetic environment description

Contextual factors:

-Phone identity-Syntaxical factors-Stress-related factors-Locational , …

Page 25: HMM-based speech synthesis: the new generation of artificial voices

2525

Labels

Example

Page 26: HMM-based speech synthesis: the new generation of artificial voices

2626

HMM training

Page 27: HMM-based speech synthesis: the new generation of artificial voices

2727

System architecture

Contextual factors may affect duration, source and filter

differently

Context Oriented Clusteringusing Decision Trees

Page 28: HMM-based speech synthesis: the new generation of artificial voices

2828

State DurationModel

HMM forSource and Filter

Decision treesfor Filter

Decision treesfor Source

Decision treefor

State Duration

System architecture

Page 29: HMM-based speech synthesis: the new generation of artificial voices

2929

Training decision trees

An exhaustive list of possible questions is first drawn up

QS "LL-Nasal" {m^*,n^*,en^*,ng^*}QS "LL-Fricative" {ch^*,dh^*,f^*,hh^*,hv^*,s^*,sh^*,th^*,v^*,z^*,zh^*}QS "LL-Liquid" {el^*,hh^*,l^*,r^*,w^*,y^*}QS "LL-Front" {ae^*,b^*,eh^*,em^*,f^*,ih^*,ix^*,iy^*,m^*,p^*,v^*,w^*}QS "LL-Central" {ah^*,ao^*,axr^*,d^*,dh^*,dx^*,el^*,en^*,er^*,l^*,n^*,r^*,s^*,t^*,th^*,z^*,zh^*}QS "LL-Back" {aa^*,ax^*,ch^*,g^*,hh^*,jh^*,k^*,ng^*,ow^*,sh^*,uh^*,uw^*,y^*}QS "LL-Front_Vowel" {ae^*,eh^*,ey^*,ih^*,iy^*}QS "LL-Central_Vowel" {aa^*,ah^*,ao^*,axr^*,er^*}QS "LL-Back_Vowel" {ax^*,ow^*,uh^*,uw^*}QS "LL-Long_Vowel" {ao^*,aw^*,el^*,em^*,en^*,en^*,iy^*,ow^*,uw^*}QS "LL-Short_Vowel" {aa^*,ah^*,ax^*,ay^*,eh^*,ey^*,ih^*,ix^*,oy^*,uh^*}QS "LL-Dipthong_Vowel" {aw^*,axr^*,ay^*,el^*,em^*,en^*,er^*,ey^*,oy^*}QS "LL-Front_Start_Vowel" {aw^*,axr^*,er^*,ey^*}

Example :

Total: about 1500 questions

Page 30: HMM-based speech synthesis: the new generation of artificial voices

3030

Training decision trees

Decision trees are trained using a Maximum Likelihood criterion

Example :

Page 31: HMM-based speech synthesis: the new generation of artificial voices

3131

Emission likelihood and training

Finally, each leaf is modeled by a Gaussian Mixture Model (GMM)

Training is guided by the Viterbi and Baum-Welch re-estimation

algorithms

Page 32: HMM-based speech synthesis: the new generation of artificial voices

3232

SYNTHESISBY THE HMM-BASED

SYNTHESIZER

Page 33: HMM-based speech synthesis: the new generation of artificial voices

3333

Text analysis

Page 34: HMM-based speech synthesis: the new generation of artificial voices

3434

Parameters generation

Page 35: HMM-based speech synthesis: the new generation of artificial voices

3535

Parameters generation

Given the sequence of labels, durations are determined by

maximizing the state sequence likelihood

A trajectory through context-dependent HMM states is known !

Page 36: HMM-based speech synthesis: the new generation of artificial voices

3636

Parameters generation

Using this trajectory, source and filter parameters are generated by maximizing the output probability

Dynamic features evolution more realistic and smooth

Page 37: HMM-based speech synthesis: the new generation of artificial voices

3737

Speech synthesizers comparison

Page 38: HMM-based speech synthesis: the new generation of artificial voices

3838

Speech synthesizers comparison

UnitSelection

DiphoneConcatenation

HTS

<1Mb 5Mb 200Mb

Quality

Footprint

Page 39: HMM-based speech synthesis: the new generation of artificial voices

3939

Content

• Speech synthesis: history

• HMM-based speech synthesis

• Parametric modeling of speech

• Statistical generation

• Conclusions

Page 40: HMM-based speech synthesis: the new generation of artificial voices

4040

Problem positioning

Parametric speech synthesizersgenerally suffer from a typical

buzziness as encountered in LPC-like vocoders

Source–Filter approach:

Enhance the excitation signal

Pulsetrain

White noise

Filter SyntheticSpeech

Page 41: HMM-based speech synthesis: the new generation of artificial voices

4141

Proposed solutionSOURCE

FILTER

T.Drugman, G.Wilfart, T.Dutoit, « A Deterministic plus Stochastic Model of the Residual Signal for Improved Parametric Speech Synthesis », Interspeech09

Page 42: HMM-based speech synthesis: the new generation of artificial voices

4242

Results

Traditional:

Proposed:

Page 43: HMM-based speech synthesis: the new generation of artificial voices

4343

Content

• Speech synthesis: history

• HMM-based speech synthesis

• Parametric modeling of speech

• Statistical generation

• Conclusions

Page 44: HMM-based speech synthesis: the new generation of artificial voices

4444Drugman ThomasDrugman Thomas

Problem of oversmoothing

Page 45: HMM-based speech synthesis: the new generation of artificial voices

4545Drugman ThomasDrugman Thomas

Compensation of oversmooting

Page 46: HMM-based speech synthesis: the new generation of artificial voices

4646Drugman ThomasDrugman Thomas

Global Variance

Page 47: HMM-based speech synthesis: the new generation of artificial voices

4747Drugman ThomasDrugman Thomas

Global Variance

Page 48: HMM-based speech synthesis: the new generation of artificial voices

4848Drugman ThomasDrugman Thomas

Results

Page 49: HMM-based speech synthesis: the new generation of artificial voices

4949

Content

• Speech synthesis: history

• HMM-based speech synthesis

• Parametric modeling of speech

• Statistical generation

• Conclusions

Page 50: HMM-based speech synthesis: the new generation of artificial voices

5050

Speech synthesizers comparison

Intelligibility Naturalness ? Mem/CPU/Voices Expressivity ?

Intelligibility Naturalness Mem/CPU/Voices Expressivity

Intelligibility Naturalness~ Mem/CPU/Voices Expressivity

Intelligibility Naturalness Mem/CPU/Voices ~ Expressivity ~

Rule-based synthesis

Diphone concatenation

Unit selection

HMM-based speech synthesis

Page 51: HMM-based speech synthesis: the new generation of artificial voices

5151

Speech synthesizers comparison

UnitSelection

DiphoneConcatenation

HTS

<1Mb 5Mb 200Mb

Quality

Footprint

Page 52: HMM-based speech synthesis: the new generation of artificial voices

5252

Future Works

Voice Conversion

Expressive/emotional synthesis

Better parametric representation

Real-time speech synthesis

Page 53: HMM-based speech synthesis: the new generation of artificial voices

5353

Questions ?