Statistical Speech Synthesis

Statistical Speech Synthesis

Heiga ZEN

Toshiba Research Europe Ltd.

Cambridge Research Laboratory

Speech Synthesis Seminar Series @ CUED, Cambridge, UK

January 11th, 2011

Text-to-speech synthesis (TTS)

Text (seq of discrete symbols) Speech (continuous time series)

Automatic speech recognition (ASR)

Speech (continuous time series) Text (seq of discrete symbols)

Machine Translation (MT)

Text (seq of discrete symbols) Text (seq of discrete symbols)

2

Text-to-speech as a mapping problem

Dobré ráno Good morning

Good morning

Good morning

text

(concept)

speech

air flow

Sound source

voiced: pulse

unvoiced: noise

frequency

transfer

characteristics

magnitude

start--end

fundamental

frequency

modula

tion o

f carr

ier

wave

by s

peech info

rmation

fun

dam

enta

l fr

eq.

vo

iced

/unvoic

ed

fre

q. tr

ans. ch

ar.

3

Speech production process

Rule-based, formant synthesis (~’90s)

– Based on parametric representation of speech

– Hand-crafted rules to control phonetic unit

DECtalk (or KlattTalk / MITTalk) [Klatt;‘82]

4

Speech synthesis methods (1)

Block diagram of KlattTalk

Corpus-based, concatenative synthesis (’90s~)

– Concatenate small speech units (e.g., phone) from a database

– Large data + automatic learning High-quality synthetic voices

Single inventory; diphone synthesis [Moullnes;‘90]

Multiple inventory; unit selection synthesis [Sagisaka;‘92, Black;‘96]

5


Corpus-based, statistical parametric synthesis (mid ’90s~)

– Large data + automatic training

Automatic voice building

– Source-filter model + statistical modeling

Flexible to change its voice characteristics

Hidden Markov models (HMMs) as its statistical acoustic model

HMM-based speech synthesis (HTS) [Yoshimura;‘02]

6


Parameter

generation

Model

training

Feature

extraction

Waveform

generation

7

Popularity of statistical speech synthesis

# of statistical speech synthesis

related papers in ICASSP

02468

101214161820

1995

1997

1999

2001

2003

2005

2007

2009

# o

f p

ap

ers

Statistical speech synthesis is getting popular, but…

not many researchers fully understand how it works

Formulate & understand the whole corpus-based speech

synthesis process in a unified statistical framework

Aim of this talk

Outline

9

HMM-based speech synthesis

– Overview

– Implementation of individual components

Bayesian framework for speech synthesis

– Formulation

– Realizations in HMM-based speech synthesis

– Recent works

Conclusions

– Summary

– Future research topics

Excitation

generation

Synthesis

Filter

TEXT

Text analysis

SYNTHESIZED

SPEECH

Training HMMs

Parameter generation

from HMMsLabelsExcitation

parametersExcitation

Spectral

parameters

Labels

Training part

Synthesis part

Excitation

Parameter

extraction

SPEECH

DATABASESpectral

Parameter

Extraction

Spectral

parameters

Excitation

parameters

Speech signal

HMM-based speech synthesis system (HTS)

10

Context-dependent HMMs

& state duration models

Text analysis

Text analysis

Excitation

generation

Synthesis

Filter

TEXT

Text analysis

SYNTHESIZED

SPEECH

Training HMMs


from HMMs



LabelsExcitation


Spectral

parameters

Labels

Training part

Synthesis part

Excitation

Parameter

extraction

SPEECH

DATABASESpectral

Parameter

Extraction

Spectral

parameters

Excitation

parameters

Speech signal


11

air flow

frequency

transfer

characteristics

magnitude

start--end

fundamental

frequency

mod

ula

tion o

f carr

ier

wave

by s

peech info

rmation speech

Sound source

voiced: pulse

unvoiced: noise

Speech production process

12

Divide speech into frames

13

Speech is a non-stationary signal

… but can be assumed to be quasi-stationary

Divide speech into short-time frames (e.g., 5ms shift, 25ms length)

linear

time-invariant

system

excitationpulse train

(voiced)

white noise

(unvoiced)

speech

Excitation (source) part Spectral (filter) part

Fourier transform

Source-filter model

14

Parametric models speech spectrum

Autoregressive (AR) model Exponential (EX) model

ML estimation of spectral model parameters

: AR model Linear prediction (LP) [Itakura;’70]

: EX model ML-based cepstral analysis

Spectral (filter) model

15

……

LP analysis (1)

LP analysis (2)

Excitation (source) model

excitationpulse train

white noise

18

Excitation model: pulse/noise excitation

– Voiced (periodic) pulse trains

– Unvoiced (aperiodic) white noise

Excitation model parameters

– V/UV decision

– V fundamental frequency (F0):

Speech samples

19

Natural speech

Reconstructed speech from extracted parameters (cepstral

coefficients & F0 with V/UV decisions)

Quality degrades, but main characteristics are preserved

Text analysis

SPEECH

DATABASEExcitation

Parameter

extraction

Spectral

Parameter

Extraction

Excitation

generation

Synthesis

Filter

TEXT

Text analysis

SYNTHESIZED

SPEECH


from HMMs



LabelsExcitation


Spectral

parameters

Speech signal Training part

Synthesis part

Training HMMs

Spectral

parameters

Excitation

parameters

Labels


20

Structure of state-output (observation) vector

Spectrum part

Excitation part

Spectral parameters

(e.g., cepstrum, LSPs)

log F0 with V/UV

21

Dynamic features

22

HMM-based modeling

・・

1 1 2 3 3 N

Observation

sequence

State

sequence

23

Sentence

HMM

Label

sequencesil a i sil

1 2 31 2 3

Spectrum

(cepstrum or LSP,

& dynamic features)

Excitation

(log F0

& dynamic features)

Stre

am

12

34

Multi-stream HMM structure

24

Time

Log F

req

uency

Unable to model by continuous or discrete distribution

Observation of F0

25

1 2 3

unvoiced

voiced

unvoiced

voiced

unvoiced

voiced

・・・

voiced/unvoiced weights

Multi-space probability distribution (MSD)

26

Voiced

Unvoiced

Voiced

Unvoiced

Voiced

Unvoiced

・

・

・

Log F0

Spectral params Single Gaussian

MSD

(Gaussian & discrete)

MSD


MSD


Str

eam

1S

tre

am

2,3

,4Structure of state-output distributions

27

data & labels

Compute variance

floor

Initialize CI-HMMs by

segmental k-means

Reestimate CI-HMMs by

EM algorithm

Copy CI-HMMs to

CD-HMMs

monophone

(context-independent, CI)

Reestimate CD-HMMs by

EM algorithm

Decision tree-based

clustering


EM algorithm

Untie parameter tying

structure

Estimate CD-dur Models

from FB stats

Decision tree-based

clustering

Estimated dur models

Estimated

HMMs

fullcontext

(context-dependent, CD)

Training process

29

HMM-based modeling

・・

1 1 2 3 3 N

Observation

sequence

State

sequence

30

Sentence

HMM

Transcription

sil sil-a+i a-i+sil sil

1 2 31 2 3

Phoneme current phoneme

{preceding, succeeding} two phonemes

Syllable # of phonemes at {preceding, current, succeeding} syllable

{accent, stress} of {preceding, current, succeeding} syllable

Position of current syllable in current word

# of {preceding, succeeding} {accented, stressed} syllable in current phrase

# of syllables {from previous, to next} {accented, stressed} syllable

Vowel within current syllable

Word Part of speech of {preceding, current, succeeding} word

# of syllables in {preceding, current, succeeding} word

Position of current word in current phrase

# of {preceding, succeeding} content words in current phrase

# of words {from previous, to next} content word

Phrase # of syllables in {preceding, current, succeeding} phrase

…..

Huge # of combinations ⇒ Difficult to have all possible models

Context-dependent modeling

31

data & labels

Compute variance

floor


segmental k-means


EM algorithm

Copy CI-HMMs to

CD-HMMs

monophone



EM algorithm

Decision tree-based

clustering


EM algorithm


structure


from FB stats

Decision tree-based

clustering


Estimated

HMMs

fullcontext


Training process

32

k-a+b/A:…

t-e+h/A:……

…

…

w-a+t/A:…

w-i+sil/A:… w-o+sh/A:…

gy-e+sil/A:…

gy-a+pau/A:…

g-u+pau/A:…

…

…

leaf nodes

synthesized

states

yes

yes

yesyes

yes no

no

no

no

no

R=silence?

L=“gy”?

C=voiced?

L=“w”?

R=silence?

Decision tree-based context clustering [Odell;’95]

33

Spectrum & excitation have different context dependency

Build decision trees separately

Decision trees

for

mel-cepstrum

Decision trees

for F0

Stream-dependent clustering

34

data & labels

Compute variance

floor


segmental k-means


EM algorithm

Copy CI-HMMs to

CD-HMMs

monophone



EM algorithm

Decision tree-based

clustering


EM algorithm


structure


from FB stats

Decision tree-based

clustering


Estimated

HMMs

fullcontext


Training process

35

1 2 3 4 5 6 7 8

i

0t 1t

Tt

Estimation of state duration models [Yoshimura;’98]

36

Decision trees

for

mel-cepstrum

Decision trees

for F0

HMM

State duration

model

Decision tree

for state dur.

models

Stream-dependent clustering

37

data & labels

Compute variance

floor


segmental k-means


EM algorithm

Copy CI-HMMs to

CD-HMMs

monophone



EM algorithm

Decision tree-based

clustering


EM algorithm


structure


from FB stats

Decision tree-based

clustering


Estimated

HMMs

fullcontext


Training process

38

SPEECH

DATABASEExcitation

Parameter

extraction

Spectral

Parameter

Extraction

Excitation

generation

Synthesis

Filter

SYNTHESIZED

SPEECH

Excitation


Synthesis part

Training HMMs

Spectral

parameters

Excitation

parameters

Labels

TEXT

Text analysisParameter generation

from HMMs



LabelsExcitation

parameters

Spectral

parameters


39

Composition of sentence HMM for given text

TEXT

Text analysis

context-dependent label

sequence

G2P

POS tagging

Text normalization

Pause prediction

sentence HMM given labels

Speech parameter generation algorithm

41

Observation

sequence

State sequence

・・

1 2 3

1 1 1 1 2 2 3 3

Determine state sequence via determining state durations

Determination of state sequence (1)

4 10 5State duration

42


43

Geometric

Gaussian

0.01 2 3 4 5 6 7 8 State duration

0.1

0.2

0.3

0.4

0.5

Sta

te-d

ura

tion p

rob

.


44

45


Mean Variance

step-wise, mean values

Without dynamic features

46

Speech param. vectors includes both static & dyn. feats.

M2

The relationship between & can be arranged as

M M

Integration of dynamic features

47

48


Solution

49

Mean Variance

Dyn

am

icS

tatic

Generated speech parameter trajectory

50

50

12

34

5(k

Hz)

01

23

4(k

Hz)

a i silsil

Generated spectra

51

w/o dynamic

features

w/ dynamic

features

SPEECH

DATABASEExcitation

Parameter

extraction

Spectral

Parameter

Extraction

TEXT

Text analysis

Training HMMs


from HMMs



Labels

Spectral

parameters

Excitation

parameters

Labels


Synthesis partExcitation

generation

Synthesis

Filter

SYNTHESIZED

SPEECH

Excitation


Spectral

parameters


52

linear

time-invariant

systemexcitation

pulse train

white noise

synthesized

speech

Generated

excitation parameter

(log F0 with V/UV)

Generated

spectral parameter

(cepstrum, LSP)

Source-filter model

53

filtering

Unvoiced frames & LP spectral coefficients

white noise Synthesized speech

Drive linear filter using white noise

Equivalent to sampling from Gaussian distribution

Speech samples

55

w/o dynamic features

w/ dynamic features

Use of dynamic features can reduce discontinuity

Outline

56


– Overview



– Formulation


– Recent works

Conclusions

– Summary


57

We have a speech database, i.e., a set of texts &

corresponding speech waveforms.

Given a text to be synthesized, what is the speech waveform

corresponding to the text?

: speech waveform

: speech waveforms

: set of texts

: text to be synthesized

databaseGiven

unknown

Statistical framework for speech synthesis (1)

58

Bayesian framework for prediction

1. Estimate predictive distribution given variables

2. Draw sample from the distribution

: speech waveform

: speech waveforms

: set of texts

: text to be synthesized

databaseGiven

unknown

Bayesian framework for speech synthesis (2)

59

: acoustic model (e.g. HMM )

1. Estimating predictive distribution is hard

Introduce acoustic model parameters


60

: parametric representation of speech waveform

(e.g., cepstrum, LPC, LSP, F0, aperiodicity)


2. Using speech waveform directly is difficult

Introduce parametric its representation

61

: labels derived from text

(e.g. prons, POS, lexical stress, grammar, pause)


3. Same texts can have multiple pronunciations, POS, etc.

Introduce labels

62


4. Difficult to perform integral & sum over auxiliary variables

Approximated by joint max

63


5. Joint maximization is hard

Approximated by step-by-step maximizations

64


6. Training also requires parametric form of wav & labels

Introduce them & approx by step-by-step maximizations

: parametric representation of speech waveforms

: labels derived from texts

65


SPEECH

DATABASEExcitation

Parameter

extraction

Spectral

Parameter

Extraction

Excitation

generation

Synthesis

Filter

TEXT

Text analysis

SYNTHESIZED

SPEECH

Training HMMs


from HMMs



LabelsExcitation


Spectral

parameters

Spectral

parameters

Excitation

parameters

Labels


Synthesis part66

Text analysis


67

Problems

Many approximations

– Integral & sum ≈ max

– Joint max ≈ step-by-step max

Poor approximation

Recent works to relax approximations

– Max Integral & sum

Bayesian acoustic modeling

Multiple labels

– Step-wise max Joint max

Statistical vocoding

68

Bayesian acoustic modeling (1)

69


Bayesian approach

– Parameters are hidden variables & marginalized out

– Bayesian approach with hidden variables intractable

Variational Bayes [Attias;’99]

Jensen's inequality

70


Variational Bayesian acoustic modeling for speech

synthesis [Nankaku;’03]

– Fully VB-based speech synthesis

Training posterior distribution of model parameters

Parameter generation from predictive distribution

– Automatic model selection

Bayesian approach provides posterior probability of model structure

– Setting priors

Evidence maximization [Hashimoto;’06]

Cross validation [Hashimoto;’09]

– VB approach works better than ML one when

Data is small

Model is large

Multiple labels (1)

Label sequence is regarded as hidden variable & marginalized

72

Multiple labels (2)

Joint front-end / back-end model training [Oura;’08]

– Labels = regarded as hidden variable & marginalized

Robust against label errors

– Front- & back-end models are trained simultaneously

Combine text analysis & acoustic models as a unified model

73

Simple pulse/noise vocoding

Vocal tract filter

excitation

pulse train

white noise

synthesized

speech

Basic pulse/noise vocoder

– Binary switching between voiced & unvoiced excitations

Difficult to represent mix of voiced & unvoiced sounds

– Excitations signals of human speech are not pulse or noise

Colored voiced/unvoiced excitations

74

State-dependent filtering [Maia;’07]

Synthesized

speech

Voiced

excitation

Filters

Log F0

values

Mel-cepstral

coefficients

Sentence

HMM

Mixed

excitation

Unvoiced

excitation

2tC 1tC tC 1tC 2tC

2tp 1tp tp 1tp 2tp

+

Pulse train

generator

White noise

75

Waveform-level statistical model (1) [Maia;’10]

Synthesized

speech

Voiced

excitation

Mixed

excitation

+

Pulse train

generator

White noiseUnvoiced

excitation

76

Waveform-level statistical model (2) [Maia;’10]

Integral & sum are intractable

Approx integral & sum by joint max

Conventional step-by-step maximization

Proposed iterative joint maximization

Outline

77


– Overview



– Formulation


– Recent works

Conclusions

– Summary


Summary

78


– Statistical parametric speech synthesis approach

– Source-filter representation of speech + statistical acoustic modeling

– Getting popular


– Formulation

– Decomposition to sub-problems

– Correspondence between sub-problems & modules in HMM-based

speech synthesis system

– Recent works to relax approximations

Drawbacks of HMM-based speech synthesis

Quality of synthesized speech

– Buzzy

– Flat

– Muffled

Three major factors degrade the quality

– Poor vocoding

how to parameterize speech?

– Inaccurate acoustic modeling

how to model extracted speech parameter trajectories?

– Over-smoothing

how to recover generated speech parameter trajectories?

Still need a lot of works to improve the quality

79

Future challenging topics in speech synthesis

Keynote speech by Simon King in ISCA SSW7 last year

Speech synthesis is easy, if ...

• voice is built offline & carefully checked for errors

• speech is recorded in clean conditions

• word transcriptions are correct

• accurate phonetic labels are available or can be obtained

• speech is in the required language & speaking style

• speech is from a suitable speaker

• a native speaker is available, preferably a linguist

Speech synthesis is not easy if we don’t have right data

80

Future challenging topics in speech synthesis

Non-professional speakers

• AVM + adaptation (CSTR)

Too little speech data

• VTLN-based rapid speaker adaptation (Titech, IDIAP)

Noisy recordings

• Spectral subtraction & AVM + adaptation (CSTR)

No labels

• Un- / Semi-supervised voice building (CSTR, NICT, CMU, Toshiba)

Insufficient knowledge of the language or accent

• Letter (grapheme)-based synthesis (CSTR)

• No prosodic contexts (CSTR, Titech)

Wrong language

• Cross-lingual speaker adaptation (MSRA, EMIME)

• Speaker & language adaptive training (Toshiba)

81

Thanks!

82

Documents

Statistical Speech Synthesis