69
Generative Model-Based Text-to-Speech Synthesis Heiga Zen (Google London) February rd, @MIT

Generative Model-Based [6pt] Text-to-Speech Synthesis · Generative Model-Based Text-to-Speech Synthesis Heiga Zen (Google London) Februaryrd,@MIT

  • Upload
    lynhan

  • View
    228

  • Download
    0

Embed Size (px)

Citation preview

Generative Model-BasedText-to-Speech Synthesis

Heiga Zen (Google London)

February �rd, ����@MIT

Outline

Generative TTS

Generative acoustic models for parametric TTSHidden Markov models (HMMs)Neural networks

Beyond parametric TTSLearned featuresWaveNetEnd-to-end

Conclusion & future topics

Outline

Generative TTS

Generative acoustic models for parametric TTSHidden Markov models (HMMs)Neural networks

Beyond parametric TTSLearned featuresWaveNetEnd-to-end

Conclusion & future topics

Text-to-speech as sequence-to-sequence mapping

Automatic speech recognition (ASR)

! “Hello my name is Heiga Zen”

Machine translation (MT)“Hello my name is Heiga Zen”! “Ich heiße Heiga Zen”

Text-to-speech synthesis (TTS)

“Hello my name is Heiga Zen”!

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��

Speech production process

text(concept)

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r wav

eby

spe

ech

info

rmat

ion

fund

amen

tal f

req

voic

ed/u

nvoi

ced

freq

tran

sfer

cha

r

air flow

Sound sourcevoiced: pulseunvoiced: noise

speech

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��

Typical �ow of TTS system

Sentence segmentation

Word segmentation

Text normalization

Part-of-speech tagging

Pronunciation

Prosody prediction

Waveform generation

TEXT

Text analysis

SYNTHESIZED

SEECH

Speech synthesisdiscrete ) discrete

discrete ) continuous

NLP

Speech

Frontend

Backend

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��

Speech synthesis approaches

Rule-based, formant synthesis [�] Sample-based, concatenativesynthesis [�]

Model-based, generative synthesis

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��

Speech synthesis approaches

Rule-based, formant synthesis [�]

Sample-based, concatenativesynthesis [�]

Model-based, generative synthesis

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��

Speech synthesis approaches

Rule-based, formant synthesis [�] Sample-based, concatenativesynthesis [�]

Model-based, generative synthesis

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��

Speech synthesis approaches

Rule-based, formant synthesis [�] Sample-based, concatenativesynthesis [�]

Model-based, generative synthesis

| text=”Hello, my name is Heiga Zen.”)p(speech=

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��

Probabilistic formulation of TTS

Random variables

X Speech waveforms (data) ObservedW Transcriptions (data) Observedw Given text Observedx Synthesized speech Unobserved

Synthesis• Estimate posterior predictive distribution! p(x | w,X ,W)

• Sample ¯

x from the posterior distribution

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� 6 of ��

Probabilistic formulation of TTS

Random variables

X Speech waveforms (data) ObservedW Transcriptions (data) Observedw Given text Observedx Synthesized speech Unobserved

Synthesis• Estimate posterior predictive distribution! p(x | w,X ,W)

• Sample ¯

x from the posterior distribution

X W w

x

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� 6 of ��

Probabilistic formulation

Introduce auxiliary variables (representation) + factorize dependency

p(x | w,X ,W) =

y X

8l

X

8L

�p(x | o)p(o | l, �)p(l | w)

p(X | O)p(O | L, �)p(�)p(L | W)/ p(X )

dodOd�

where

O,o: Acoustic featuresL, l: Linguistic features

�: Model

X W w

x

L lO

o

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��

Approximation (�)

Approximate {sum & integral} by best point estimates (like MAP) [�]

p(x | w,X ,W) ⇡ p(x | ˆ

o)

where

o, ˆl, ˆO, ˆL, ˆ�} = arg max

o,l,O,L,�

p(x | o)p(o | l, �)p(l | w)

p(X | O)p(O | L, �)p(�)p(L | W)

ˆ

o

X W w

x

L lO

o

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� 8 of ��

Approximation (�)Joint! Step-by-step maximization [�]

ˆO = arg max

Op(X | O) Extract acoustic features

ˆL = arg max

Lp(L | W) Extract linguistic features

ˆ� = arg max

p(

ˆO | ˆL, �)p(�) Learn mapping

ˆ

l = arg max

l

p(l | w) Predict linguistic features

ˆ

o = arg max

o

p(o | ˆ

l, ˆ�) Predict acoustic features

¯

x ⇠ fx

(

ˆ

o) = p(x | ˆ

o) Synthesize waveform

X

O

W

L

w

l

ˆO ˆL

o

ˆ�

ˆ

l

x

ˆ

o

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��

Approximation (�)Joint! Step-by-step maximization [�]

ˆO = arg max

Op(X | O) Extract acoustic features

ˆL = arg max

Lp(L | W)

ˆ� = arg max

p(

ˆO | ˆL, �)p(�)

ˆ

l = arg max

l

p(l | w)

ˆ

o = arg max

o

p(o | ˆ

l, ˆ�)

¯

x ⇠ fx

(

ˆ

o) = p(x | ˆ

o)

X

O

W

L

w

l

ˆO ˆL

o

ˆ�

ˆ

l

x

ˆ

o

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��

Approximation (�)Joint! Step-by-step maximization [�]

ˆO = arg max

Op(X | O) Extract acoustic features

ˆL = arg max

Lp(L | W) Extract linguistic features

ˆ� = arg max

p(

ˆO | ˆL, �)p(�)

ˆ

l = arg max

l

p(l | w)

ˆ

o = arg max

o

p(o | ˆ

l, ˆ�)

¯

x ⇠ fx

(

ˆ

o) = p(x | ˆ

o)

X

O

W

L

w

l

ˆO ˆL

o

ˆ�

ˆ

l

x

ˆ

o

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��

Approximation (�)Joint! Step-by-step maximization [�]

ˆO = arg max

Op(X | O) Extract acoustic features

ˆL = arg max

Lp(L | W) Extract linguistic features

ˆ� = arg max

p(

ˆO | ˆL, �)p(�) Learn mapping

ˆ

l = arg max

l

p(l | w)

ˆ

o = arg max

o

p(o | ˆ

l, ˆ�)

¯

x ⇠ fx

(

ˆ

o) = p(x | ˆ

o)

X

O

W

L

w

l

ˆO ˆL

o

ˆ�

ˆ

l

x

ˆ

o

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��

Approximation (�)Joint! Step-by-step maximization [�]

ˆO = arg max

Op(X | O) Extract acoustic features

ˆL = arg max

Lp(L | W) Extract linguistic features

ˆ� = arg max

p(

ˆO | ˆL, �)p(�) Learn mapping

ˆ

l = arg max

l

p(l | w) Predict linguistic features

ˆ

o = arg max

o

p(o | ˆ

l, ˆ�)

¯

x ⇠ fx

(

ˆ

o) = p(x | ˆ

o)

X

O

W

L

w

l

ˆO ˆL

o

ˆ�

ˆ

l

x

ˆ

o

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��

Approximation (�)Joint! Step-by-step maximization [�]

ˆO = arg max

Op(X | O) Extract acoustic features

ˆL = arg max

Lp(L | W) Extract linguistic features

ˆ� = arg max

p(

ˆO | ˆL, �)p(�) Learn mapping

ˆ

l = arg max

l

p(l | w) Predict linguistic features

ˆ

o = arg max

o

p(o | ˆ

l, ˆ�) Predict acoustic features

¯

x ⇠ fx

(

ˆ

o) = p(x | ˆ

o)

X

O

W

L

w

l

ˆO ˆL

o

ˆ�

ˆ

l

x

ˆ

o

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��

Approximation (�)Joint! Step-by-step maximization [�]

ˆO = arg max

Op(X | O) Extract acoustic features

ˆL = arg max

Lp(L | W) Extract linguistic features

ˆ� = arg max

p(

ˆO | ˆL, �)p(�) Learn mapping

ˆ

l = arg max

l

p(l | w) Predict linguistic features

ˆ

o = arg max

o

p(o | ˆ

l, ˆ�) Predict acoustic features

¯

x ⇠ fx

(

ˆ

o) = p(x | ˆ

o) Synthesize waveform

X

O

W

L

w

l

ˆO ˆL

o

ˆ�

ˆ

l

x

ˆ

o

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��

Approximation (�)Joint! Step-by-step maximization [�]

ˆO = arg max

Op(X | O) Extract acoustic features

ˆL = arg max

Lp(L | W) Extract linguistic features

ˆ� = arg max

p(

ˆO | ˆL, �)p(�) Learn mapping

ˆ

l = arg max

l

p(l | w) Predict linguistic features

ˆ

o = arg max

o

p(o | ˆ

l, ˆ�) Predict acoustic features

¯

x ⇠ fx

(

ˆ

o) = p(x | ˆ

o) Synthesize waveform

X

O

W

L

w

l

ˆO ˆL

o

ˆ�

ˆ

l

x

ˆ

o

Representations: acoustic, linguistic, mapping

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� � of ��

Representation – Linguistic features

Hello, world.

Hello, world.

hello world

h-e2 l-ou1 w-er1-l-d

h e l ou w er l d Phone: voicing, manner, ...

Syllable: stress, tone, ...

Word: POS, grammar, ...

Phrase: intonation, ...

Sentence: length, ...

! Based on knowledge about spoken language• Lexicon, letter-to-sound rules• Tokenizer, tagger, parser• Phonology rules

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

Representation – Linguistic features

Hello, world.

Hello, world.

hello world

h-e2 l-ou1 w-er1-l-d

h e l ou w er l d Phone: voicing, manner, ...

Syllable: stress, tone, ...

Word: POS, grammar, ...

Phrase: intonation, ...

Sentence: length, ...

! Based on knowledge about spoken language• Lexicon, letter-to-sound rules• Tokenizer, tagger, parser• Phonology rules

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

Representation – Acoustic features

Piece-wise stationary, source-�lter generative model p(x | o)

Pulse train (voiced)

White noise (unvoiced)

Speech

Vocal source Vocal tract filter

Fundamental

frequency

Ap

erio

dic

ity,

vo

icin

g, ...

+

0

80

0

[dB]

8 [kHz]

Cepstrum, LPC, ...

e(n)

x(n) = h(n)*e(n)

h(n)

overlap/shift

windowing

! Needs to solve inverse problem• Estimate parameters from signals• Use estimated parameters (e.g., cepstrum) as acoustic features

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

Representation – Acoustic features

Piece-wise stationary, source-�lter generative model p(x | o)

Pulse train (voiced)

White noise (unvoiced)

Speech

Vocal source Vocal tract filter

Fundamental

frequency

Ap

erio

dic

ity,

vo

icin

g, ...

+

0

80

0

[dB]

8 [kHz]

Cepstrum, LPC, ...

e(n)

x(n) = h(n)*e(n)

h(n)

overlap/shift

windowing

! Needs to solve inverse problem• Estimate parameters from signals• Use estimated parameters (e.g., cepstrum) as acoustic features

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

Representation – Mapping

Rule-based, formant synthesis [�]

ˆO = arg max

Op(X | O) Vocoder analysis

ˆL = arg max

Lp(L | W) Text analysis

ˆ� = arg max

p(

ˆO | ˆL, �)p(�) Extract rules

ˆ

l = arg max

l

p(l | w) Text analysis

ˆ

o = arg max

o

p(o | ˆ

l, ˆ�) Apply rules

¯

x ⇠ fx

(

ˆ

o) = p(x | ˆ

o) Vocoder synthesis

X

O

W

L

w

l

ˆO ˆL

o

ˆ�

ˆ

l

x

ˆ

o

! Hand-crafted rules on knowledge-based features

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

Representation – Mapping

Rule-based, formant synthesis [�]

ˆO = arg max

Op(X | O) Vocoder analysis

ˆL = arg max

Lp(L | W) Text analysis

ˆ� = arg max

p(

ˆO | ˆL, �)p(�) Extract rules

ˆ

l = arg max

l

p(l | w) Text analysis

ˆ

o = arg max

o

p(o | ˆ

l, ˆ�) Apply rules

¯

x ⇠ fx

(

ˆ

o) = p(x | ˆ

o) Vocoder synthesis

X

O

W

L

w

l

ˆO ˆL

o

ˆ�

ˆ

l

x

ˆ

o

! Hand-crafted rules on knowledge-based featuresHeiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

Representation – Mapping

HMM-based [�], statistical parametric synthesis [�]

ˆO = arg max

Op(X | O) Vocoder analysis

ˆL = arg max

Lp(L | W) Text analysis

ˆ� = arg max

p(

ˆO | ˆL, �)p(�) Train HMMs

ˆ

l = arg max

l

p(l | w) Text analysis

ˆ

o = arg max

o

p(o | ˆ

l, ˆ�) Parameter generation

¯

x ⇠ fx

(

ˆ

o) = p(x | ˆ

o) Vocoder synthesis

X

O

W

L

w

l

ˆO ˆL

o

ˆ�

ˆ

l

x

ˆ

o

! Replace rules by HMM-based generative acoustic model

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

Representation – Mapping

HMM-based [�], statistical parametric synthesis [�]

ˆO = arg max

Op(X | O) Vocoder analysis

ˆL = arg max

Lp(L | W) Text analysis

ˆ� = arg max

p(

ˆO | ˆL, �)p(�) Train HMMs

ˆ

l = arg max

l

p(l | w) Text analysis

ˆ

o = arg max

o

p(o | ˆ

l, ˆ�) Parameter generation

¯

x ⇠ fx

(

ˆ

o) = p(x | ˆ

o) Vocoder synthesis

X

O

W

L

w

l

ˆO ˆL

o

ˆ�

ˆ

l

x

ˆ

o

! Replace rules by HMM-based generative acoustic modelHeiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

Outline

Generative TTS

Generative acoustic models for parametric TTSHidden Markov models (HMMs)Neural networks

Beyond parametric TTSLearned featuresWaveNetEnd-to-end

Conclusion & future topics

HMM-based generative acoustic model for TTS

• Context-dependent subword HMMs• Decision trees to cluster & tie HMM states! interpretable

q1

o

1

q2

o

2

q3

o

3

q4

o

4

l

l1

lN

o

1

o

2

o

3 Too

2

... ... ... ...

o

4

o

2

o

6

o

5

...

...

: Discrete

: Continuous

p(o | l, �) =

X

8q

TY

t=1

p(o

t

| qt

, �)P (q | l, �) qt

: hidden state at t

=

X

8q

TY

t=1

N (o

t

qt ,⌃qt)P (q | l, �)

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

HMM-based generative acoustic model for TTS

• Non-smooth, step-wise statistics! Smoothing is essential

• Dif�cult to use high-dimensional acoustic features (e.g., raw spectra)! Use low-dimensional features (e.g., cepstra)

• Data fragmentation! Ineffective, local representation

A lot of research work have been done to address these issues

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �6 of ��

Alternative acoustic model

HMM: Handle variable length & alignmentDecision tree: Map linguistic! acoustic

yes noyes no

...

yes no

yes no yes no

Statistics of acoustic features o

Linguistic features l

Regression tree: linguistic features! Stats. of acoustic features

Replace the tree w/ a general-purpose regression model! Arti�cial neural network

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

Alternative acoustic model

HMM: Handle variable length & alignmentDecision tree: Map linguistic! acoustic

yes noyes no

...

yes no

yes no yes no

Statistics of acoustic features o

Linguistic features l

Regression tree: linguistic features! Stats. of acoustic features

Replace the tree w/ a general-purpose regression model! Arti�cial neural network

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

FFNN-based acoustic model for TTS [6]

ht

ltFrame-level linguistic feature

otFrame-level acoustic feature

Input

Target

lt

o t o t+1

o t−1

lt+1

lt−1

h

t

= g (W

hl

l

t

+ b

h

)

ˆ� = arg min

X

t

kot

� ˆ

o

t

k2

ˆ

o

t

= W

oh

h

t

+ b

o

� = {Whl

,Woh

, bh

, bo

o

t

⇡ E [o

t

| lt

]! Replace decision trees & Gaussian distributions

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �8 of ��

RNN-based acoustic model for TTS [�]

Recurrentconnections

lt

o t o t+1

o t−1

lt+1

lt−1

ltFrame-level linguistic feature

otFrame-level acoustic feature

Input

Target

h

t

= g (W

hl

l

t

+ W

hh

h

t�1 + b

h

)

ˆ� = arg min

X

t

kot

� ˆ

o

t

k2

ˆ

o

t

= W

oh

h

t

+ b

o

� = {Whl

,Whh

,Woh

, bh

, bo

}FFNN: ˆ

o

t

⇡ E [o

t

| lt

] RNN: ˆ

o

t

⇡ E [o

t

| l1, . . . , lt]Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

NN-based generative acoustic model for TTS

• Non-smooth, step-wise statistics! RNN predicts smoothly varying acoustic features [�, 8]

• Dif�cult to use high-dimensional acoustic features (e.g., raw spectra)! Layered architecture can handle high-dimensional features [�]

• Data fragmentation! Distributed representation [��]

NN-based approach is now mainstream in research & products• Models: FFNN [6], MDN [��], RNN [�], Highway network [��], GAN [��]• Products: e.g., Google [��]

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

NN-based generative acoustic model for TTS

• Non-smooth, step-wise statistics! RNN predicts smoothly varying acoustic features [�, 8]

• Dif�cult to use high-dimensional acoustic features (e.g., raw spectra)! Layered architecture can handle high-dimensional features [�]

• Data fragmentation! Distributed representation [��]

NN-based approach is now mainstream in research & products• Models: FFNN [6], MDN [��], RNN [�], Highway network [��], GAN [��]• Products: e.g., Google [��]

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

Outline

Generative TTS

Generative acoustic models for parametric TTSHidden Markov models (HMMs)Neural networks

Beyond parametric TTSLearned featuresWaveNetEnd-to-end

Conclusion & future topics

NN-based generative model for TTS

text(concept)

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r wav

eby

spe

ech

info

rmat

ion

fund

amen

tal f

req

voic

ed/u

nvoi

ced

freq

tran

sfer

cha

r

air flow

Sound sourcevoiced: pulseunvoiced: noise

speech

Text! Linguistic! (Articulatory)! Acoustic!Waveform

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

Knowledge-based features! Learned features

Unsupervised feature learning

x(t) (raw FFT spectrum)

Acousticfeatureo(t)

)

0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0hello

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0w

orld

w(n)

(1-hot representation of word)

w(n+1)

~x(t)

Linguisticfeaturel(n)

)l(n-1)

• Speech: auto-encoder at FFT spectra [�, ��]! positive results• Text: word [�6], phone & syllable [��]! less positive

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

Relax approximationJoint acoustic feature extraction & model training

Two-step optimization! Joint optimization

8><

>:

ˆO = arg max

Op(X | O)

ˆ� = arg max

p(

ˆO | ˆL, �)p(�)

+{ˆ�, ˆO} = arg max

�,Op(X | O)p(O | ˆL, �)p(�)

Joint source-�lter & acoustic model optimization• HMM [�8, ��, ��]• NN [��, ��]

X

O

W

L

w

l

ˆL

o

ˆ�

ˆ

l

x

ˆ

o

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

Relax approximationJoint acoustic feature extracion & model training

Mixed-phase cepstral analysis + LSTM-RNN [��]

Linguisticfeatures ......

...

Back propagation

Forw

ard

prop

agat

ion

z-1 z-1

-1

Derivatives

z-1 z-1z z-

...

Speech

Cepstrum

... ...

Pulse train

lt

pn

G(z)

H -1u (z)

f

s

e

o

(v)t

d

(u)td

(v)t

o

(u)t

xn

n

n

n

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

Relax approximationDirect mapping from linguistic to waveform

No explicit acoustic features

{ˆ�, ˆO} = arg max

�,Op(X | O)p(O | ˆL, �)p(�)

+ˆ� = arg max

p(X | ˆL, �)p(�)

Generative models for raw audio• LPC [��]• WaveNet [��]• SampleRNN [��]

X W

L

w

l

ˆL

ˆ�

ˆ

l

x

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �6 of ��

WaveNet: A generative model for raw audio

Autoregressive (AR) modelling of speech signals

x = {x0, x1, . . . , xN�1} : raw waveform

p(x | �) = p(x0, x1, . . . , xN�1 | �) =

N�1Y

n=0

p(xn

| x0, . . . , xn�1,�)

WaveNet [��]! p(x

n

| x0, . . . , xn�1, �) is modeled by convolutional NN

Key components• Causal dilated convolution: capture long-term dependency• Gated convolution + residual + skip: powerful non-linearity• Softmax at output: classi�cation rather than regression

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

WaveNet: A generative model for raw audio

Autoregressive (AR) modelling of speech signals

x = {x0, x1, . . . , xN�1} : raw waveform

p(x | �) = p(x0, x1, . . . , xN�1 | �) =

N�1Y

n=0

p(xn

| x0, . . . , xn�1,�)

WaveNet [��]! p(x

n

| x0, . . . , xn�1, �) is modeled by convolutional NN

Key components• Causal dilated convolution: capture long-term dependency• Gated convolution + residual + skip: powerful non-linearity• Softmax at output: classi�cation rather than regression

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

WaveNet: A generative model for raw audio

Autoregressive (AR) modelling of speech signals

x = {x0, x1, . . . , xN�1} : raw waveform

p(x | �) = p(x0, x1, . . . , xN�1 | �) =

N�1Y

n=0

p(xn

| x0, . . . , xn�1,�)

WaveNet [��]! p(x

n

| x0, . . . , xn�1, �) is modeled by convolutional NN

Key components• Causal dilated convolution: capture long-term dependency• Gated convolution + residual + skip: powerful non-linearity• Softmax at output: classi�cation rather than regression

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

WaveNet – Causal dilated convolution

���ms in �6kHz sampling = �,6�� time steps! Too long to be captured by normal RNN/LSTM

Dilated convolutionExponentially increase receptive �eld size w.r.t. # of layers

Input

Output

Hiddenlayer3

Hiddenlayer2

Hiddenlayer1

. . .

p(x | x ,..., x )n n-16n-1

xn-1

xn-2

xn-3

xn-16

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �8 of ��

WaveNet – Non-linearity

Residual block

Residual block

Residual block

Residual block

30

ReLU

1x1256

ReLU

1x1256

Softmax

256

Skip connections

0

p(x | x ,..., x )n n-1

: 1x1 convolution1x1

ReLU : ReLU activation

Gated : Gated activationSoftmax : Softmax activation

...

2x1 dilated

Gated

1x1

1x1256

512

Residual block

To residual block

To skip connection

+

...

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

WaveNet – Softmax

Time

Am

plitu

de

Analog audio signal

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

WaveNet – Softmax

Time

Am

plitu

de

Sampling & Quantization

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

WaveNet – Softmax

Time

Am

plitu

de

1

16

Ca

teg

ory

in

de

x

Categorical distribution → Histogram

- Unimodal

- Multimodal

- Skewed

...

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

WaveNet – Conditional modelling

Embeddingat time n

hn

Linguisticfeatures l

2x1 dilated

Gated

1x1

1x1

Residual block

1x1

1x1

1x1

1x1

Residual block

Residual block

Residual block

Residual block

......

ReLU

1x1

ReLU

1x1

Softmax

0

p(x | x ,..., x ,l)n n-1... +

conditioning

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

WaveNet vs conventional audio generative models

Assumptions in conventional audio generative models [��, �6, ��, ��]• Stationary process w/ �xed-length analysis window! Estimate model within ��–��ms window w/ �–�� shift

• Linear, time-invariant �lter within a frame! Relationship between samples can be non-linear

• Gaussian process! Assumes speech signals are normally distributed

WaveNet• Sample-by-saple, non-linear, capable to take additional inputs• Arbitrary-shaped signal distribution

SOTA subjective naturalness w/ WaveNet-based TTS [��]HMM LSTM Concatenative WaveNet

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

Relax approximationTowards Bayesian end-to-end TTS

Integrated end-to-end8><

>:

ˆL = arg max

Lp(L | W)

ˆ� = arg max

p(X | ˆL, �)p(�)

+ˆ� = arg max

p(X | W , �)p(�)

Text analysis is integrated to model

X W w

ˆ�

x

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

Relax approximationTowards Bayesian end-to-end TTS

Bayesian end-to-end8<

:

ˆ� = arg max

p(X | W , �)p(�)

¯

x ⇠ fx

(w, ˆ�) = p(x | w, ˆ�)

x ⇠ fx

(w,X ,W) = p(x | w,X ,W)

=

Zp(x | w, �)p(� | X ,W)d�

⇡ 1

K

KX

k=1

p(x | w, ˆ�k

) Ensemble

Marginalize model parameters & architecture

X W w

x

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

Outline

Generative TTS

Generative acoustic models for parametric TTSHidden Markov models (HMMs)Neural networks

Beyond parametric TTSLearned featuresWaveNetEnd-to-end

Conclusion & future topics

Generative model-based text-to-speech synthesis

• Bayes formulation + factorization + approximations

• Representation: acoustic features, linguistic features, mapping� Mapping: Rules! HMM! NN� Feature: Engineered! Unsupervised, learned

• Less approximations� Joint training, direct waveform modelling� Moving towards integrated & Bayesian end-to-end TTS

Naturalness: Concatenative Generative

Flexibility: Concatenative⌧ Generative (e.g., multiple speakers)

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �6 of ��

Beyond “text”-to-speech synthesis

TTS on conversational assistants

• Texts aren’t fully contained

• Need more context� Location to resolve homographs� User query to put right emphasis

We need representation that can

organize the world information & make it accessible & useful

from TTS generative models ,

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

Beyond “text”-to-speech synthesis

TTS on conversational assistants

• Texts aren’t fully contained

• Need more context� Location to resolve homographs� User query to put right emphasis

We need representation that can

organize the world information & make it accessible & useful

from TTS generative models ,

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

Beyond “generative” TTS

Generative model-based TTS

• Model represents process behind speech production� Trained to minimize error against human-produced speech� Learned model! speaker

• Speech is for communication� Goal: maximize the amount of information to be received

Missing “listener”! “listener” in training / model itself?

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �8 of ��

Beyond “generative” TTS

Generative model-based TTS

• Model represents process behind speech production� Trained to minimize error against human-produced speech� Learned model! speaker

• Speech is for communication� Goal: maximize the amount of information to be received

Missing “listener”! “listener” in training / model itself?

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �8 of ��

Thanks!

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

References I

[�] D. Klatt.Real-time speech synthesis by rule.Journal of ASA, 68(S�):S�8–S�8, ��8�.

[�] A. Hunt and A. Black.Unit selection in a concatenative speech synthesis system using a large speech database.In Proc. ICASSP, pages ���–��6, ���6.

[�] K. Tokuda.Speech synthesis as a statistical machine learning problem.https://www.sp.nitech.ac.jp/

~

tokuda/tokuda_asru2011_for_pdf.pdf.Invited talk given at ASRU ����.

[�] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura.Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis.IEICE Trans. Inf. Syst., J8�-D-II(��):����–����, ����.(in Japanese).

[�] H. Zen, K. Tokuda, and A. Black.Statistical parametric speech synthesis.Speech Commn., ��(��):����–��6�, ����.

[6] H. Zen, A. Senior, and M. Schuster.Statistical parametric speech synthesis using deep neural networks.In Proc. ICASSP, pages ��6�–��66, ����.

[�] Y. Fan, Y. Qian, F.-L. Xie, and F. Soong.TTS synthesis with bidirectional LSTM based recurrent neural networks.In Proc. Interspeech, pages ��6�–��68, ����.

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

References II

[8] H. Zen.Acoustic modeling for speech synthesis: from HMM to RNN.http://research.google.com/pubs/pub44630.html.Invited talk given at ASRU ����.

[�] S. Takaki and J. Yamagishi.A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speechsynthesis.In Proc. ICASSP, pages ����–����, ���6.

[��] G. Hinton, J. McClelland, and D. Rumelhart.Distributed representation.In D. Rumelhart, J. McClelland, and the PDP Research Group, editors, Parallel distributed processing: Explorations in themicrostructure of cognition. MIT Press, ��86.

[��] H. Zen and A. Senior.Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis.In Proc. ICASSP, pages �8��–�8�6, ����.

[��] X. Wang, S. Takaki, and J. Yamagishi.Investigating very deep highway networks for parametric speech synthesis.In Proc. ISCA SSW�, ���6.

[��] Y. Saito, S. Takamichi, and Saruwatari.Training algorithm to deceive anti-spoo�ng veri�cation for DNN-based speech synthesis.In Proc. ICASSP, ����.

[��] H. Zen, Y. Agiomyrgiannakis, N. Egberts, F. Henderson, and P. Szczepaniak.Fast, compact, and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices.In Proc. Interspeech, ���6.

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

References III

[��] P. Muthukumar and A. Black.A deep learning approach to data-driven parameterizations for statistical parametric speech synthesis.arXiv:����.8��8, ����.

[�6] P. Wang, Y. Qian, F. Soong, L. He, and H. Zhao.Word embedding for recurrent neural network based TTS synthesis.In Proc. ICASSP, pages �8��–�88�, ����.

[��] X. Wang, S. Takaki, and J. Yamagishi.Investigation of using continuous representation of various linguistic units in neural network-based text-to-speech synthesis.IEICE Trans. Inf. Syst., E��-D(��):����–��8�, ���6.

[�8] T. Toda and K. Tokuda.Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory hmm.In Proc. ICASSP, pages ����–���8, ���8.

[��] Y.-J. Wu and K. Tokuda.Minimum generation error training with direct log spectral distortion on LSPs for HMM-based speech synthesis.In Proc. Interspeech, pages ���–�8�, ���8.

[��] R. Maia, H. Zen, and M. Gales.Statistical parametric speech synthesis with joint estimation of acoustic and excitation model parameters.In Proc. ISCA SSW�, pages 88–��, ����.

[��] K. Tokuda and H. Zen.Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis.In Proc. ICASSP, pages ����–����, ����.

[��] K. Tokuda and H. Zen.Directly modeling voiced and unvoiced components in speech waveforms by neural networks.In Proc. ICASSP, pages �6��–�6��, ���6.

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

References IV

[��] F. Itakura and S. Saito.A statistical method for estimation of speech spectral density and formant frequencies.Trans. IEICE, J���A:��–��, ����.

[��] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu.WaveNet: A generative model for raw audio.arXiv:�6��.�����, ���6.

[��] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio.SampleRNN: An unconditional end-to-end neural audio generation model.arXiv:�6��.��8��, ���6.

[�6] S. Imai and C. Furuichi.Unbiased estimation of log spectrum.In Proc. EURASIP, pages ���–��6, ��88.

[��] H. Kameoka, Y. Ohishi, D. Mochihashi, and J. Le Roux.Speech analysis with multi-kernel linear prediction.In Proc. Spring Conference of ASJ, pages ���–���, ����.(in Japanese).

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��

X

O

ˆL

x

ˆo

X W w

x

(1) Bayesian

X W w

x

L l

O

o

(2) Auxiliary variables + factorization

(5) Joint acoustic feature extraction + model training

(6) Conditional WaveNet-based TTS

(7) Integrated end-to-end (8) Bayesian end-to-end

X W w

L l

O

o

(3) Joint maximization

x

ˆo

W

L

w

l

o

ˆ�

ˆ

l

X

ˆL

x

W

L

w

l

ˆ�

ˆ

l

X

x

W w

ˆ�

X

x

W w

X

O

W

L

w

l

ˆO ˆL

o

ˆ�

ˆ

l

x

ˆo

(4) Step-by-step maximizatione.g., statistical parametric TTS

Heiga Zen Generative Model-Based Text-to-Speech Synthesis February �rd, ���� �� of ��