Prosody-Controllable HMM-Based Speech Synthesis Using Speech Input

2015©Shinnosuke TAKAMICHI

09/19/2015

Prosody-Controllable HMM-Based

Speech Synthesis Using Speech Input

Yuri Nishigaki, Shinnosuke Takamichi, Tomoki Toda,

Graham Neubig, Sakriani Sakti, Satoshi Nakamura (NAIST)

MLSLP2015 in Aizu Univ.

/17

Speech-based creative activities

and HMM-based speech synthesis

2

Singing voice Speech

Advertisement Live concert Narration Next?

Video avatar

Voice actor

…

Useful method: HMM-based speech synthesis [Tokuda et al., 2013.]

“Synthesize!”

Synthetic speech parameters

text speech

/17

Manual control of synthetic speech

Laugh

Sad

Regression

Multi-Regression HMM [Nose et al., 2007.]

Manually manipulating HMM parameters

User

User

They are very useful, but difficult to control as the user wants.

/17

Motivation of this study

Functions we want

– Original capability of HMM-based TTS

– Speech-based control

• Intuitive to control

• Make synthetic speech mimic input speech prosody

Our work

– Speech synthesis having both functions

4

Synthesize System

Synthesize “Synthesize.”

MR-HMM etc.

Similar to VOCALISTENER for singing voice control

/17

Overview of the proposed system

(Only text is input.)

5

Input text

Text analysis

Waveform generation

Synthetic speech

Parameter

generation

Synthesis

HMM Original HMM-based

speech synthesis

/17

Overview of the proposed system

(Text & speech are input.)

6

Input text Input speech

Speech analysis Text analysis

Waveform generation

Synthetic speech

F0

modification

Duration

extraction

Parameter

generation

Alignment

HMM

Synthesis

HMM

/17

Duration extraction module

7

Alignment

HMM

Synthesis

HMM

Feature of

input speech

Context of

Input text

HMM

alignment

Duration

generation

State duration of

synthetic speech Parm. Gen.

Duration of input speech

/17

Alignment accuracy & duration unit

How to build alignment HMMs suitable for input speech?

– → The use of pre-recorded speech uttered by users

– Large amounts → user-dependent HMMs

– Small amounts → HMMs adapted from original alignment HMMs

How to map the input speech duration to synthetic speech?

– Alignment/synthesis HMM-states represent different speech segments.

– Which is better, HMM-state, phone, or mora-level duration unit?

8

/17

Speech parameter generation module

9

Synthesis

HMM

Context of

Input text

Parameter

generation

Spectrum of

synthetic speech

F0 generated

From HMMs

Dur. ext.

State duration

F0 mod. Wav. Gen.

/17

F0 modification module

10

Feature of

input speech

F0 generated

from HMMs

F0

conversion

U/V region

modification

Parm. gen.

F0 of

synthetic speech Wav. Gen.

/17

F0 conversion &

unvoiced/voiced modification

11

F0

Time

Reference

generated from HMMs

Input speech

F0-converted

U/V-modified

F0 conversion fixes F0 range of input speech to fit to reference.

U/V modification fixes the U/V region of input speech to fit to reference.

Linear

conversion

Spline

interpolation

EXPERIMENTAL EVALUATION

12

/17

Experimental Setup

13

Content Value/Setting

User 4 Japanese speakers (2 male & 2 female)

Target speaker 1 Japanese female speaker

Training data of synthesis HMMs

450 phoneme-balanced sentences, 16 kHz-sampled, 5 ms shift, reading style

Evaluation data 53 phoneme-balanced sentences

Speech features 25-dim. mel-cestrum, log F0, 5-band aperiodicity

Speech analyzer STRAIGHT [Kawahara et al., 1999.]

Text analyzer Open-jtalk

Acoustic model 5-state HSMM [Zen et al., 2007.]

1. duration unit & alignment HMM adaptation

2. synthesis HMM adaptation

3. effect of U/V modification

/17

Evaluation 1: duration unit &

alignment HMM adaptation

3 duration units

– State / phoneme / mora-level duration

4 HMMs using different amounts of pre-recorded speech

– 0 … target-speaker-dependent HMMs (= synthesis HMM)

– 1 … HMMs adapted using 1 utterance uttered by the user

– 56 … HMMs adapted using 56 utterances

– 450 … user-dependent HMMs

Evaluation

– MOS test on naturalness of synthetic speech

– DMOS test on prosody mimicking ability of synthetic speech

• Input speech is presented as reference.

14

/17

Result 1: duration unit &

alignment HMM adaptation

15

1

2

3

4

5 MOS on naturalness DMOS on prosody mimicking ability

0 1 56 450 utts.

We can confirm (1) adaptation is effective, and

(2) phoneme-level dur. is relatively robust.

No significant diff. No significant diff.

state phone mora

/17

Experiment 2: Effectiveness of U/V

modification in naturalness P

refe

ren

ce

sco

re o

n n

atu

raln

ess [%

]

0

20

40

60

80

100

Spkr1 Spkr2 Spkr3 Spkr4

U/V

mo

dific

atio

n r

atio

[%

]

0

5

10

15

20

Spkr1 Spkr2 Spkr3 Spkr4

w/o or w/ modification U->V or V->U modification

U/V modification can improve the naturalness!

(especially when many U frames of input speech are fixed.)

/17

Conclusion

2 functions to control synthetic speech

– An original function of HMM-based TTS

• MR-HMM or manual control

– Speech-based control

• Intuitive for users

2 main modules of our system

– Mimic duration.

• Copy duration of input speech to synthetic speech.

– Mimic F0 patterns.

• Copy dynamic F0 pattern of input speech to synthetic speech.

Future work

– HMM selection using text & speech 17

Science

Prosody-Controllable HMM-Based Speech Synthesis Using Speech Input