81
Statistical Speech Synthesis Heiga ZEN Toshiba Research Europe Ltd. Cambridge Research Laboratory Speech Synthesis Seminar Series @ CUED, Cambridge, UK January 11th, 2011

Statistical Speech Synthesis

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Statistical Speech Synthesis

Statistical Speech Synthesis

Heiga ZEN

Toshiba Research Europe Ltd.

Cambridge Research Laboratory

Speech Synthesis Seminar Series @ CUED, Cambridge, UK

January 11th, 2011

Page 2: Statistical Speech Synthesis

Text-to-speech synthesis (TTS)

Text (seq of discrete symbols) Speech (continuous time series)

Automatic speech recognition (ASR)

Speech (continuous time series) Text (seq of discrete symbols)

Machine Translation (MT)

Text (seq of discrete symbols) Text (seq of discrete symbols)

2

Text-to-speech as a mapping problem

Dobré ráno Good morning

Good morning

Good morning

Page 3: Statistical Speech Synthesis

text

(concept)

speech

air flow

Sound source

voiced: pulse

unvoiced: noise

frequency

transfer

characteristics

magnitude

start--end

fundamental

frequency

modula

tion o

f carr

ier

wave

by s

peech info

rmation

fun

dam

enta

l fr

eq.

vo

iced

/unvoic

ed

fre

q. tr

ans. ch

ar.

3

Speech production process

Page 4: Statistical Speech Synthesis

Rule-based, formant synthesis (~’90s)

– Based on parametric representation of speech

– Hand-crafted rules to control phonetic unit

DECtalk (or KlattTalk / MITTalk) [Klatt;‘82]

4

Speech synthesis methods (1)

Block diagram of KlattTalk

Page 5: Statistical Speech Synthesis

Corpus-based, concatenative synthesis (’90s~)

– Concatenate small speech units (e.g., phone) from a database

– Large data + automatic learning High-quality synthetic voices

Single inventory; diphone synthesis [Moullnes;‘90]

Multiple inventory; unit selection synthesis [Sagisaka;‘92, Black;‘96]

5

Speech synthesis methods (2)

Page 6: Statistical Speech Synthesis

Corpus-based, statistical parametric synthesis (mid ’90s~)

– Large data + automatic training

Automatic voice building

– Source-filter model + statistical modeling

Flexible to change its voice characteristics

Hidden Markov models (HMMs) as its statistical acoustic model

HMM-based speech synthesis (HTS) [Yoshimura;‘02]

6

Speech synthesis methods (3)

Parameter

generation

Model

training

Feature

extraction

Waveform

generation

Page 7: Statistical Speech Synthesis

7

Popularity of statistical speech synthesis

# of statistical speech synthesis

related papers in ICASSP

02468

101214161820

1995

1997

1999

2001

2003

2005

2007

2009

# o

f p

ap

ers

Page 8: Statistical Speech Synthesis

Statistical speech synthesis is getting popular, but…

not many researchers fully understand how it works

Formulate & understand the whole corpus-based speech

synthesis process in a unified statistical framework

Aim of this talk

Page 9: Statistical Speech Synthesis

Outline

9

HMM-based speech synthesis

– Overview

– Implementation of individual components

Bayesian framework for speech synthesis

– Formulation

– Realizations in HMM-based speech synthesis

– Recent works

Conclusions

– Summary

– Future research topics

Page 10: Statistical Speech Synthesis

Excitation

generation

Synthesis

Filter

TEXT

Text analysis

SYNTHESIZED

SPEECH

Training HMMs

Parameter generation

from HMMsLabelsExcitation

parametersExcitation

Spectral

parameters

Labels

Training part

Synthesis part

Excitation

Parameter

extraction

SPEECH

DATABASESpectral

Parameter

Extraction

Spectral

parameters

Excitation

parameters

Speech signal

HMM-based speech synthesis system (HTS)

10

Context-dependent HMMs

& state duration models

Text analysis

Page 11: Statistical Speech Synthesis

Text analysis

Excitation

generation

Synthesis

Filter

TEXT

Text analysis

SYNTHESIZED

SPEECH

Training HMMs

Parameter generation

from HMMs

Context-dependent HMMs

& state duration models

LabelsExcitation

parametersExcitation

Spectral

parameters

Labels

Training part

Synthesis part

Excitation

Parameter

extraction

SPEECH

DATABASESpectral

Parameter

Extraction

Spectral

parameters

Excitation

parameters

Speech signal

HMM-based speech synthesis system (HTS)

11

Page 12: Statistical Speech Synthesis

air flow

frequency

transfer

characteristics

magnitude

start--end

fundamental

frequency

mod

ula

tion o

f carr

ier

wave

by s

peech info

rmation speech

Sound source

voiced: pulse

unvoiced: noise

Speech production process

12

Page 13: Statistical Speech Synthesis

Divide speech into frames

13

Speech is a non-stationary signal

… but can be assumed to be quasi-stationary

Divide speech into short-time frames (e.g., 5ms shift, 25ms length)

Page 14: Statistical Speech Synthesis

linear

time-invariant

system

excitationpulse train

(voiced)

white noise

(unvoiced)

speech

Excitation (source) part Spectral (filter) part

Fourier transform

Source-filter model

14

Page 15: Statistical Speech Synthesis

Parametric models speech spectrum

Autoregressive (AR) model Exponential (EX) model

ML estimation of spectral model parameters

: AR model Linear prediction (LP) [Itakura;’70]

: EX model ML-based cepstral analysis

Spectral (filter) model

15

Page 16: Statistical Speech Synthesis

……

LP analysis (1)

Page 17: Statistical Speech Synthesis

LP analysis (2)

Page 18: Statistical Speech Synthesis

Excitation (source) model

excitationpulse train

white noise

18

Excitation model: pulse/noise excitation

– Voiced (periodic) pulse trains

– Unvoiced (aperiodic) white noise

Excitation model parameters

– V/UV decision

– V fundamental frequency (F0):

Page 19: Statistical Speech Synthesis

Speech samples

19

Natural speech

Reconstructed speech from extracted parameters (cepstral

coefficients & F0 with V/UV decisions)

Quality degrades, but main characteristics are preserved

Page 20: Statistical Speech Synthesis

Text analysis

SPEECH

DATABASEExcitation

Parameter

extraction

Spectral

Parameter

Extraction

Excitation

generation

Synthesis

Filter

TEXT

Text analysis

SYNTHESIZED

SPEECH

Parameter generation

from HMMs

Context-dependent HMMs

& state duration models

LabelsExcitation

parametersExcitation

Spectral

parameters

Speech signal Training part

Synthesis part

Training HMMs

Spectral

parameters

Excitation

parameters

Labels

HMM-based speech synthesis system (HTS)

20

Page 21: Statistical Speech Synthesis

Structure of state-output (observation) vector

Spectrum part

Excitation part

Spectral parameters

(e.g., cepstrum, LSPs)

log F0 with V/UV

21

Page 22: Statistical Speech Synthesis

Dynamic features

22

Page 23: Statistical Speech Synthesis

HMM-based modeling

・ ・

1 1 2 3 3 N

Observation

sequence

State

sequence

23

Sentence

HMM

Label

sequencesil a i sil

1 2 31 2 3

Page 24: Statistical Speech Synthesis

Spectrum

(cepstrum or LSP,

& dynamic features)

Excitation

(log F0

& dynamic features)

Stre

am

12

34

Multi-stream HMM structure

24

Page 25: Statistical Speech Synthesis

Time

Log F

req

uency

Unable to model by continuous or discrete distribution

Observation of F0

25

Page 26: Statistical Speech Synthesis

1 2 3

unvoiced

voiced

unvoiced

voiced

unvoiced

voiced

・ ・ ・

voiced/unvoiced weights

Multi-space probability distribution (MSD)

26

Page 27: Statistical Speech Synthesis

Voiced

Unvoiced

Voiced

Unvoiced

Voiced

Unvoiced

Log F0

Spectral params Single Gaussian

MSD

(Gaussian & discrete)

MSD

(Gaussian & discrete)

MSD

(Gaussian & discrete)

Str

eam

1S

tre

am

2,3

,4Structure of state-output distributions

27

Page 28: Statistical Speech Synthesis

data & labels

Compute variance

floor

Initialize CI-HMMs by

segmental k-means

Reestimate CI-HMMs by

EM algorithm

Copy CI-HMMs to

CD-HMMs

monophone

(context-independent, CI)

Reestimate CD-HMMs by

EM algorithm

Decision tree-based

clustering

Reestimate CD-HMMs by

EM algorithm

Untie parameter tying

structure

Estimate CD-dur Models

from FB stats

Decision tree-based

clustering

Estimated dur models

Estimated

HMMs

fullcontext

(context-dependent, CD)

Training process

29

Page 29: Statistical Speech Synthesis

HMM-based modeling

・ ・

1 1 2 3 3 N

Observation

sequence

State

sequence

30

Sentence

HMM

Transcription

sil sil-a+i a-i+sil sil

1 2 31 2 3

Page 30: Statistical Speech Synthesis

Phoneme current phoneme

{preceding, succeeding} two phonemes

Syllable # of phonemes at {preceding, current, succeeding} syllable

{accent, stress} of {preceding, current, succeeding} syllable

Position of current syllable in current word

# of {preceding, succeeding} {accented, stressed} syllable in current phrase

# of syllables {from previous, to next} {accented, stressed} syllable

Vowel within current syllable

Word Part of speech of {preceding, current, succeeding} word

# of syllables in {preceding, current, succeeding} word

Position of current word in current phrase

# of {preceding, succeeding} content words in current phrase

# of words {from previous, to next} content word

Phrase # of syllables in {preceding, current, succeeding} phrase

…..

Huge # of combinations ⇒ Difficult to have all possible models

Context-dependent modeling

31

Page 31: Statistical Speech Synthesis

data & labels

Compute variance

floor

Initialize CI-HMMs by

segmental k-means

Reestimate CI-HMMs by

EM algorithm

Copy CI-HMMs to

CD-HMMs

monophone

(context-independent, CI)

Reestimate CD-HMMs by

EM algorithm

Decision tree-based

clustering

Reestimate CD-HMMs by

EM algorithm

Untie parameter tying

structure

Estimate CD-dur Models

from FB stats

Decision tree-based

clustering

Estimated dur models

Estimated

HMMs

fullcontext

(context-dependent, CD)

Training process

32

Page 32: Statistical Speech Synthesis

k-a+b/A:…

t-e+h/A:……

w-a+t/A:…

w-i+sil/A:… w-o+sh/A:…

gy-e+sil/A:…

gy-a+pau/A:…

g-u+pau/A:…

leaf nodes

synthesized

states

yes

yes

yesyes

yes no

no

no

no

no

R=silence?

L=“gy”?

C=voiced?

L=“w”?

R=silence?

Decision tree-based context clustering [Odell;’95]

33

Page 33: Statistical Speech Synthesis

Spectrum & excitation have different context dependency

Build decision trees separately

Decision trees

for

mel-cepstrum

Decision trees

for F0

Stream-dependent clustering

34

Page 34: Statistical Speech Synthesis

data & labels

Compute variance

floor

Initialize CI-HMMs by

segmental k-means

Reestimate CI-HMMs by

EM algorithm

Copy CI-HMMs to

CD-HMMs

monophone

(context-independent, CI)

Reestimate CD-HMMs by

EM algorithm

Decision tree-based

clustering

Reestimate CD-HMMs by

EM algorithm

Untie parameter tying

structure

Estimate CD-dur Models

from FB stats

Decision tree-based

clustering

Estimated dur models

Estimated

HMMs

fullcontext

(context-dependent, CD)

Training process

35

Page 35: Statistical Speech Synthesis

1 2 3 4 5 6 7 8

i

0t 1t

Tt

Estimation of state duration models [Yoshimura;’98]

36

Page 36: Statistical Speech Synthesis

Decision trees

for

mel-cepstrum

Decision trees

for F0

HMM

State duration

model

Decision tree

for state dur.

models

Stream-dependent clustering

37

Page 37: Statistical Speech Synthesis

data & labels

Compute variance

floor

Initialize CI-HMMs by

segmental k-means

Reestimate CI-HMMs by

EM algorithm

Copy CI-HMMs to

CD-HMMs

monophone

(context-independent, CI)

Reestimate CD-HMMs by

EM algorithm

Decision tree-based

clustering

Reestimate CD-HMMs by

EM algorithm

Untie parameter tying

structure

Estimate CD-dur Models

from FB stats

Decision tree-based

clustering

Estimated dur models

Estimated

HMMs

fullcontext

(context-dependent, CD)

Training process

38

Page 38: Statistical Speech Synthesis

SPEECH

DATABASEExcitation

Parameter

extraction

Spectral

Parameter

Extraction

Excitation

generation

Synthesis

Filter

SYNTHESIZED

SPEECH

Excitation

Speech signal Training part

Synthesis part

Training HMMs

Spectral

parameters

Excitation

parameters

Labels

TEXT

Text analysisParameter generation

from HMMs

Context-dependent HMMs

& state duration models

LabelsExcitation

parameters

Spectral

parameters

HMM-based speech synthesis system (HTS)

39

Page 39: Statistical Speech Synthesis

Composition of sentence HMM for given text

TEXT

Text analysis

context-dependent label

sequence

G2P

POS tagging

Text normalization

Pause prediction

sentence HMM given labels

Page 40: Statistical Speech Synthesis

Speech parameter generation algorithm

41

Page 41: Statistical Speech Synthesis

Observation

sequence

State sequence

・ ・

1 2 3

1 1 1 1 2 2 3 3

Determine state sequence via determining state durations

Determination of state sequence (1)

4 10 5State duration

42

Page 42: Statistical Speech Synthesis

Determination of state sequence (2)

43

Page 43: Statistical Speech Synthesis

Geometric

Gaussian

0.01 2 3 4 5 6 7 8 State duration

0.1

0.2

0.3

0.4

0.5

Sta

te-d

ura

tion p

rob

.

Determination of state sequence (3)

44

Page 44: Statistical Speech Synthesis

45

Speech parameter generation algorithm

Page 45: Statistical Speech Synthesis

Mean Variance

step-wise, mean values

Without dynamic features

46

Page 46: Statistical Speech Synthesis

Speech param. vectors includes both static & dyn. feats.

M2

The relationship between & can be arranged as

M M

Integration of dynamic features

47

Page 47: Statistical Speech Synthesis

48

Speech parameter generation algorithm

Page 48: Statistical Speech Synthesis

Solution

49

Page 49: Statistical Speech Synthesis

Mean Variance

Dyn

am

icS

tatic

Generated speech parameter trajectory

50

Page 50: Statistical Speech Synthesis

50

12

34

5(k

Hz)

01

23

4(k

Hz)

a i silsil

Generated spectra

51

w/o dynamic

features

w/ dynamic

features

Page 51: Statistical Speech Synthesis

SPEECH

DATABASEExcitation

Parameter

extraction

Spectral

Parameter

Extraction

TEXT

Text analysis

Training HMMs

Parameter generation

from HMMs

Context-dependent HMMs

& state duration models

Labels

Spectral

parameters

Excitation

parameters

Labels

Speech signal Training part

Synthesis partExcitation

generation

Synthesis

Filter

SYNTHESIZED

SPEECH

Excitation

parametersExcitation

Spectral

parameters

HMM-based speech synthesis system (HTS)

52

Page 52: Statistical Speech Synthesis

linear

time-invariant

systemexcitation

pulse train

white noise

synthesized

speech

Generated

excitation parameter

(log F0 with V/UV)

Generated

spectral parameter

(cepstrum, LSP)

Source-filter model

53

filtering

Page 53: Statistical Speech Synthesis

Unvoiced frames & LP spectral coefficients

white noise Synthesized speech

Drive linear filter using white noise

Equivalent to sampling from Gaussian distribution

Page 54: Statistical Speech Synthesis

Speech samples

55

w/o dynamic features

w/ dynamic features

Use of dynamic features can reduce discontinuity

Page 55: Statistical Speech Synthesis

Outline

56

HMM-based speech synthesis

– Overview

– Implementation of individual components

Bayesian framework for speech synthesis

– Formulation

– Realizations in HMM-based speech synthesis

– Recent works

Conclusions

– Summary

– Future research topics

Page 56: Statistical Speech Synthesis

57

We have a speech database, i.e., a set of texts &

corresponding speech waveforms.

Given a text to be synthesized, what is the speech waveform

corresponding to the text?

: speech waveform

: speech waveforms

: set of texts

: text to be synthesized

databaseGiven

unknown

Statistical framework for speech synthesis (1)

Page 57: Statistical Speech Synthesis

58

Bayesian framework for prediction

1. Estimate predictive distribution given variables

2. Draw sample from the distribution

: speech waveform

: speech waveforms

: set of texts

: text to be synthesized

databaseGiven

unknown

Bayesian framework for speech synthesis (2)

Page 58: Statistical Speech Synthesis

59

: acoustic model (e.g. HMM )

1. Estimating predictive distribution is hard

Introduce acoustic model parameters

Bayesian framework for speech synthesis (3)

Page 59: Statistical Speech Synthesis

60

: parametric representation of speech waveform

(e.g., cepstrum, LPC, LSP, F0, aperiodicity)

Bayesian framework for speech synthesis (4)

2. Using speech waveform directly is difficult

Introduce parametric its representation

Page 60: Statistical Speech Synthesis

61

: labels derived from text

(e.g. prons, POS, lexical stress, grammar, pause)

Bayesian framework for speech synthesis (5)

3. Same texts can have multiple pronunciations, POS, etc.

Introduce labels

Page 61: Statistical Speech Synthesis

62

Bayesian framework for speech synthesis (6)

4. Difficult to perform integral & sum over auxiliary variables

Approximated by joint max

Page 62: Statistical Speech Synthesis

63

Bayesian framework for speech synthesis (7)

5. Joint maximization is hard

Approximated by step-by-step maximizations

Page 63: Statistical Speech Synthesis

64

Bayesian framework for speech synthesis (8)

6. Training also requires parametric form of wav & labels

Introduce them & approx by step-by-step maximizations

: parametric representation of speech waveforms

: labels derived from texts

Page 64: Statistical Speech Synthesis

65

Bayesian framework for speech synthesis (9)

Page 65: Statistical Speech Synthesis

SPEECH

DATABASEExcitation

Parameter

extraction

Spectral

Parameter

Extraction

Excitation

generation

Synthesis

Filter

TEXT

Text analysis

SYNTHESIZED

SPEECH

Training HMMs

Parameter generation

from HMMs

Context-dependent HMMs

& state duration models

LabelsExcitation

parametersExcitation

Spectral

parameters

Spectral

parameters

Excitation

parameters

Labels

Speech signal Training part

Synthesis part66

Text analysis

HMM-based speech synthesis system (HTS)

Page 66: Statistical Speech Synthesis

67

Problems

Many approximations

– Integral & sum ≈ max

– Joint max ≈ step-by-step max

Poor approximation

Recent works to relax approximations

– Max Integral & sum

Bayesian acoustic modeling

Multiple labels

– Step-wise max Joint max

Statistical vocoding

Page 67: Statistical Speech Synthesis

68

Bayesian acoustic modeling (1)

Page 68: Statistical Speech Synthesis

69

Bayesian acoustic modeling (2)

Bayesian approach

– Parameters are hidden variables & marginalized out

– Bayesian approach with hidden variables intractable

Variational Bayes [Attias;’99]

Jensen's inequality

Page 69: Statistical Speech Synthesis

70

Bayesian acoustic modeling (3)

Variational Bayesian acoustic modeling for speech

synthesis [Nankaku;’03]

– Fully VB-based speech synthesis

Training posterior distribution of model parameters

Parameter generation from predictive distribution

– Automatic model selection

Bayesian approach provides posterior probability of model structure

– Setting priors

Evidence maximization [Hashimoto;’06]

Cross validation [Hashimoto;’09]

– VB approach works better than ML one when

Data is small

Model is large

Page 70: Statistical Speech Synthesis

Multiple labels (1)

Label sequence is regarded as hidden variable & marginalized

Page 71: Statistical Speech Synthesis

72

Multiple labels (2)

Joint front-end / back-end model training [Oura;’08]

– Labels = regarded as hidden variable & marginalized

Robust against label errors

– Front- & back-end models are trained simultaneously

Combine text analysis & acoustic models as a unified model

Page 72: Statistical Speech Synthesis

73

Simple pulse/noise vocoding

Vocal tract filter

excitation

pulse train

white noise

synthesized

speech

Basic pulse/noise vocoder

– Binary switching between voiced & unvoiced excitations

Difficult to represent mix of voiced & unvoiced sounds

– Excitations signals of human speech are not pulse or noise

Colored voiced/unvoiced excitations

Page 73: Statistical Speech Synthesis

74

State-dependent filtering [Maia;’07]

Synthesized

speech

Voiced

excitation

Filters

Log F0

values

Mel-cepstral

coefficients

Sentence

HMM

Mixed

excitation

Unvoiced

excitation

2tC 1tC tC 1tC 2tC

2tp 1tp tp 1tp 2tp

+

Pulse train

generator

White noise

Page 74: Statistical Speech Synthesis

75

Waveform-level statistical model (1) [Maia;’10]

Synthesized

speech

Voiced

excitation

Mixed

excitation

+

Pulse train

generator

White noiseUnvoiced

excitation

Page 75: Statistical Speech Synthesis

76

Waveform-level statistical model (2) [Maia;’10]

Integral & sum are intractable

Approx integral & sum by joint max

Conventional step-by-step maximization

Proposed iterative joint maximization

Page 76: Statistical Speech Synthesis

Outline

77

HMM-based speech synthesis

– Overview

– Implementation of individual components

Bayesian framework for speech synthesis

– Formulation

– Realizations in HMM-based speech synthesis

– Recent works

Conclusions

– Summary

– Future research topics

Page 77: Statistical Speech Synthesis

Summary

78

HMM-based speech synthesis

– Statistical parametric speech synthesis approach

– Source-filter representation of speech + statistical acoustic modeling

– Getting popular

Bayesian framework for speech synthesis

– Formulation

– Decomposition to sub-problems

– Correspondence between sub-problems & modules in HMM-based

speech synthesis system

– Recent works to relax approximations

Page 78: Statistical Speech Synthesis

Drawbacks of HMM-based speech synthesis

Quality of synthesized speech

– Buzzy

– Flat

– Muffled

Three major factors degrade the quality

– Poor vocoding

how to parameterize speech?

– Inaccurate acoustic modeling

how to model extracted speech parameter trajectories?

– Over-smoothing

how to recover generated speech parameter trajectories?

Still need a lot of works to improve the quality

79

Page 79: Statistical Speech Synthesis

Future challenging topics in speech synthesis

Keynote speech by Simon King in ISCA SSW7 last year

Speech synthesis is easy, if ...

• voice is built offline & carefully checked for errors

• speech is recorded in clean conditions

• word transcriptions are correct

• accurate phonetic labels are available or can be obtained

• speech is in the required language & speaking style

• speech is from a suitable speaker

• a native speaker is available, preferably a linguist

Speech synthesis is not easy if we don’t have right data

80

Page 80: Statistical Speech Synthesis

Future challenging topics in speech synthesis

Non-professional speakers

• AVM + adaptation (CSTR)

Too little speech data

• VTLN-based rapid speaker adaptation (Titech, IDIAP)

Noisy recordings

• Spectral subtraction & AVM + adaptation (CSTR)

No labels

• Un- / Semi-supervised voice building (CSTR, NICT, CMU, Toshiba)

Insufficient knowledge of the language or accent

• Letter (grapheme)-based synthesis (CSTR)

• No prosodic contexts (CSTR, Titech)

Wrong language

• Cross-lingual speaker adaptation (MSRA, EMIME)

• Speaker & language adaptive training (Toshiba)

81

Page 81: Statistical Speech Synthesis

Thanks!

82