Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Speech Synthesis asA Machine Learning Problem

Keiichi TokudaNagoya Institute of Technology

Introduction

Rule-based, formant synthesis (~’90s)– Hand-crafting each phonetic units by rules

Corpus-based, concatenative synthesis (’90s~)– Concatenate speech units (waveform) from a database

• Single inventory: diphone synthesis• Multiple inventory: unit selection synthesis

Corpus-based, statistical parametric synthesis– Source-filter model + statistical acoustic model

• Hidden Markov model: HMM-based synthesis

2

How we can formulate and understand the whole corpus-based speech synthesis process in a unified statistical framework?

Problem of speech synthesis (1/2)

3

We have a speech database, i.e., a set of texts and corresponding speech waveforms.Given a text to be synthesized, what is the speech waveform corresponding to the text?

: speech waveform

: speech waveforms

: texts

: text to be synthesized

databaseGiven

?

w

WX

x

Problem of speech synthesis (2/2)

Problem of statistical parametric speech synthesis:

4

( )LOlooo

,,|maxargˆ P=

( ) ( ) λLOλλloo

dPP∫= ,|,|maxarg

: Synthesis data (speech parameter sequence)

: Training data (speech parameter sequence)

: Label sequence for training data(pronunciation, stress, POS, pause position, etc.)

: Label sequence for synthesis data

: Model parameters (HMM parameters)λ

l

L

Ο

o

Approximation

Maximum A Posterior (MAP) & ML estimation

Speech parameter generation using estimated parameters

5

( )LOλλλ

,|maxargˆMAP P=

( ) ( )λλLOλ

PP ,|maxarg=

( )λLOλλ

,|maxargˆML P=

MAP estimation

ML estimation

( )λlooo

ˆ,|maxargˆ P= Speech parametergeneration

HMM-based speech synthesis system

SPEECHDATABASE

ExcitationParameterextraction

SpectralParameterExtraction

Excitationgeneration

Synthesis Filter

TEXT

Text analysis

SYNTHESIZEDSPEECH

Training HMMs

Parameter generationfrom HMMs

Context-dependent HMMs& state duration models

Labels Excitationparameters

Excitation

Spectralparameters

Spectralparameters

Excitationparameters

Labels

Speech signal Training part

Synthesis part6

Text analysis


SPEECHDATABASE




Synthesis Filter

TEXT

Text analysis

SYNTHESIZEDSPEECH

Parameter generationfrom HMMsLabels Excitation

parametersExcitation

Spectralparameters


Synthesis part7

Text analysis

Training HMMs


Spectralparameters


Labels

),|(maxargˆ λLOλλ

P=

Hidden Markov model (HMM)

11a 22a 33a

12a 23a

)(1 tb o )(2 tb o )(3 tb o

1o 2o 3o 4o 5o To ・・

1 2 3

1 1 1 1 2 2 3 3

o

q

Observation sequence

State sequence

8

ija

)( tqb o

: state transition probability

: output probability

Output probability of HMM

1 2 3 4 5 6 7 8

i

=T t

∑∏∑=

−==

qqoλqολο

T

ttqqq ttt

baPP1

)()|,()|(1

9

33a

1

2

3

23a )( 73 ob

1π

Model parameters can be estimated by an EM algorithm


SPEECHDATABASE




Synthesis Filter

TEXT

Text analysis

SYNTHESIZEDSPEECH

Training HMMs

Excitation

Spectralparameters


Labels


Synthesis part10

Text analysis




Spectralparameters

)ˆ,|(maxargˆ λlooo

P=

Speech parameter generation algorithm

)|(),|(max

)|(),|()|(

λqλqo

λqλqoλo

q

q

PP

PPP

≈

=∑

),ˆ|(maxargˆ

),|(maxargˆ

λqoo

λlqq

o

q

p

p

=

=

⇒

11

For given HMM , determine a speech parameter vectorSequence which maximizes

λTTTT ],,,[ 21 Toooo =

Determination of state sequence

Observation sequence

State sequence

4 10 5dState duration

The state sequence can be determined by state durations12

11a 22a 33a

12a 23a

)(1 tb o )(2 tb o )(3 tb o

1o 2o 3o 4o 5o To ・・

1 2 3

1 1 1 1 2 2 3 3

ija

)( tqb o

: state transition probability

: output probability

o

q

HMM (Hidden Markov Model)– State duration prob. depends only on transition prob.– State duration probability exponentially decreases

HSMM (Hidden Semi Markov Model)– HMM + explicit duration model ⇒ HSMM

State durations are given by means of Gaussians.

Hidden Semi Markov Model

13

1O 2O 3O 4O 5O 6O 7O 9O8O

1q 2q 3q

1d 2d 3d

Introduce state duration probability

)|( qdP

Gaussian

Speech parameter generation algorithm

)|(),|(max

)|(),|()|(

λqλqo

λqλqoλo

q

q

PP

PPP

≈

=∑

),ˆ|(maxargˆ

),|(maxargˆ

λqoo

λlqq

o

q

p

p

=

=

⇒

14

For given HMM ,determine a speech parameter vectorSequence which maximizes

λTTTT ],,,[ 21 Toooo =

Generated feature sequence

Mean Variance

becomes a sequence of mean vectors⇒ discontinuous outputs between stateso

15

frame

a i

Dynamic features

112

22

11

2

)(5.0

−+

−+

+−≈∂∂

=∆

−≈∂∂

=∆

ttt

t

ttt

tccccc

cctcc

t

t

tc1−tc 1+tc

tc∆

tc1−tc 1+tc

tc2∆

16

1−∆ tc 1+∆ tc 1−∆ tc 1+∆ tc

5.0− 5.0 1 12−

Ｓ

Integration of dynamic features

Relationship between speech parm. vec. & static feat

17

o cW

Solution for the problem (1/2)

0c

qWc=

∂∂ ),ˆ|(log λP

qqq μWWcW ˆ1

ˆ1

ˆ−− = ΣΣ TT

TTTT ],,,[ 21 Tcccc =TTTT ],,,[ ˆˆˆˆ 21 Tqqqq μμμμ =

TTTT ],,,[ ˆˆˆˆ 21 Tqqqq ΣΣΣΣ =

By setting

we obtain

where

18

Generated speech parameter trajectory

Mean Variance c

19

frame

c∆

c

a i

Generated spectra

200

12

34

5(kH

z)0

12

34

5(kH

z)

a i silsil

w/o dyn

w dyn

Spectra changing smoothly between phonemes

Conventional HMM Trajectory HMM

Training

Synthesis

Trajectory HMM

Solve inconsistency between training & synthesis

21

),|( λLCP

)ˆ,|( λlcP

),|( λLOP

Wcoλlo =|)ˆ,|(P

⇒ improving the model accuracy

is not a proper distribution of c),|(),|( λlWcλlo PP =


TEXT

Text analysis

Training HMMs



Labels

Labels

Training part

Synthesis part22

Text analysis

SPEECHDATABASE



Spectralparameters


Speech signal


Synthesis Filter

SYNTHESIZEDSPEECH


Excitation

Spectralparameters

∑=

−=M

m

mzmczH0

)(exp)(

Mel-cepstral representation of speech spectra

ML-estimation of mel-cepstrum

ML estimation of spectral parameter

)|(maxarg cxcc

p=

23

xc

: speech waveform (Gaussian process): mel-cepstrum

War

ped

frequ

ency

(rad

)

Frequency (rad)

warped frequencymel-scale frequency

ω

αα ~

1

11

1~ je

zzz −

−

−− =

−−

=

∑=

−=M

m

mzmczH0

~)(exp)(

ω

π

2π

π0 2/π

Synthesis filter

Overview of speech vocoding

Original speech

Mel-cepstrumUnvoiced/voicedF0

24

)(zH)(ne )(nxexcitation

pulse train

white noise

speech

These speech parameter modeled by HMM


SPEECHDATABASE




Synthesis Filter

TEXT

Text analysis

SYNTHESIZEDSPEECH

Training HMMs




Excitation

Spectralparameters

Spectralparameters


Labels


Synthesis part25

Text analysis

Observation of F0

Time

Log

Freq

uenc

y

Unable to model by continuous or discrete distributions⇒ Multi-space distribution HMM (MSD-HMM)

26

11 R=Ω

voiced0

2 R=Ωunvoiced

・

Structure of MSD-HMM

27

2,1w 2,2w 2,3w2

2nR=Ω

)(12 xN )(22 xN )(32 xN

3,1w 3,2w 3,3w3

3nR=Ω

)(13 xN )(23 xN )(33 xN

1 2 3

1,1w 1,2w 1,3w1

1nR=Ω

)(11 xN )(21 xN )(31 xN

MSD-HMM for F0 modeling

1 2 3HMM for F0

0,1w 0,2w 0,3w

1,1w 1,2w 1,3w

unvoiced

voiced

unvoiced

voiced

unvoiced

voiced

・・・01 R=Ω

12 R=Ω

voiced/unvoiced weights

28

Generated F0

29

Speech samples

Mel-cepstrum

w/ dyn. w/o dyn.

log F0w/ dyn.

w/o dyn.

30

Inclusion of speech analysis & waveform reconstruction

Problem of statistical parametric speech synthesis

31

( )LXlxxx

,,|maxargˆ P=

( )∑∑∫=c Cx

cx |maxarg P ( )λlc ,|P

( )LCλ ,|P× ( ) λXC dP |

Waveform reconstruction Speech parameter generation

HMM parameter estimation Speech analysis

c and consist of mel-cepstrum & F0 parametersC

Context clustering factors preceding, succeeding two phonemes Current phoneme Position of current phoneme in current syllable # of phonemes at preceding, current, succeeding syllable accent, stress of preceding, current, succeeding syllable Position of current syllable in current word # of preceding, succeeding accented, stressed syllable in current phrase # of syllables from previous, to next accented, stressed syllable Vowel within current syllable Guess at part of speech of preceding, current, succeeding word # of syllables in preceding, current, succeeding word Position of current word in current phrase # of preceding, succeeding content words in current phrase # of words from previous, to next content word # of syllables in preceding, current, succeeding phrase…

Vast # of combinations ⇒ Difficult to have possible models32

Decision tree-based state clustering

k-a+b

t-a+h

…

…

……

…

yes

yesyesyes

yes no

no

no

no

no

w-a+t

w-a+sil w-a+sh

gy-a+sil

gy-a+pau

g-a+pau

leaf nodes

synthesizedstates

R=silence?

L=“gy”?

L=voice?

L=“w”?

R=silence?

33

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

HMM

State durationmodel

Decision treefor state dur.

models

34

Inclusion of model structure parameter


35

( )LOlooo

,,|maxargˆ P=

( )∑∫=m

mP ,,|maxarg λloo

( )LOλ ,,| mP× ( ) λLO dmP ,|

Generate coefficient

Posterior ofmodel parameters

Posterior ofmodel structure

Usually fixed is usedm


SPEECHDATABASE




Synthesis Filter

SYNTHESIZEDSPEECH


Excitation

Spectralparameters


Synthesis part36

),|(maxargˆ ΛWLLL

P=

TEXT

Text analysisParameter generation

from HMMsLabels

),|(maxargˆ Λwlll

P=

Training HMMs


Spectralparameters


Labels

Text analysis

Inclusion of text analysis


37

( )WOwooo

,,|maxargˆ P=

( )∑∑∫ ∫=l Lo

λlo ,|maxarg P ( )Λ,| wlP

( )LOλ ,|P× ( ) ( ) ΛΛΛ ddPP λWL ,|

Speech parameter generation

HMM parameter estimation Text processing

Text processing

Prior


SPEECHDATABASE




Synthesis Filter

TEXT

Text analysis

SYNTHESIZEDSPEECH

Training HMMs




Excitation

Spectralparameters

Spectralparameters


Labels


Synthesis part38

Text analysis

Inclusion of all components


39

( )WXwxxx

,,|maxargˆ P=

( )∑∑∑∫ ∫=Cc Llx

cx, ,

|maxargm

P ( )mP ,,| λlc ( )Λ,| wlP

Waveform generation Parameter generation Text processing

Posterior of model parameter

Text processing Prior

( ) ΛΛ ddP λ( )Λ,|WLP× ( )XC|P

Speech analysis

( )LCλ ,,| mP× ( )LC ,|mP

Posterior of model structure

Emotional Speech Synthesis

text neutral angry

「授業中に携帯いじってんじゃねえよ！

電源切っとけ！」

“Don’t touch your cell phone during a class! Turn off it!”

「ミーティングには毎週参加しなさい！」

“You must attend the weekly meeting!”

trained with 200 utterances

Speaker Adaptation (mimicking voices)

w/o adaptation (initial model) Adapted with 4 utterances Adapted with 50 utterances Speaker-dependent model

Speaker-independent

Adaptationdata

Adapted model

MLLR-based adaptation

adaptation

？

Speaker Interpolation (mixing voices)

Linear combination of two speaker-dependent models

Model A Model BInterpolated model

1.00 0.75 0.50 0.25 0.00

0.00 0.25 0.50 0.75 1.00

A:

B:

Voice Morphing

A BTwo voices:

Four voices:A B

Interpolation of Speaking Styles

Neutral High Tension

Interpolation extrapolation

Base model A Base model B

Multilingual Speech Synthesis Japanese American English Chinese (Mandarin) (by ATR) Brazilian Portuguese (by Nitech, and ＵＦＲＪ) European Portuguese

(by Nitech, Univ of Porto, and UFRJ) Slovenian

(by Bostjan Vesnicer, University of Ljubljana, Slovenia ) Swedish

(by Anders Lundgren, KTH, Sweden) German (by University of Bonn, and Nitech) Korean (by Sang-Jin Kim, ETRI, Korea) Hungarian (by Budapest University of Technology and

Economics) Finish (by TKK, Finland) Baby English (by Univ of Edinburgh, UK) Polish, Slovak, Finnish, Arabic, Farsi, Polyglot, etc.

Singing Voice Synthesis

Singing voice database

Musical score

Trained HMMs

Singing voice forany piece of music

male：

female：

male：

female：

Conclusion

• Whole speech synthesis process is described as a statistical framework

• It gives a unified view and reveals what is correct and what is wrong

• Importance of the databaseFuture work• Still we have many problems should be solved:

• Speech waveform modeling• Combination with text processing part

47

Statistical approach to speech synthesis

Documents

Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts