22
Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya Institute of Technology 2 September, 2011

Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

  • Upload
    caia

  • View
    39

  • Download
    0

Embed Size (px)

DESCRIPTION

Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011. Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya Institute of Technology 2 September, 2011. Background. HMM-based speech synthesis Quality of synthesized speech depends on acoustic models - PowerPoint PPT Presentation

Citation preview

Page 1: Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Overview of NIT HMM-basedspeech synthesis system

for Blizzard Challenge 2011

Kei Hashimoto, Shinji Takaki, Keiichiro Oura,

and Keiichi Tokuda

Nagoya Institute of Technology

2 September, 2011

Page 2: Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Background

HMM-based speech synthesis Quality of synthesized speech depends on

acoustic models Model estimation is one of the most important

problem

Appropriate training algorithm is required Deterministic annealing EM (DAEM) algorithm

To overcome the local maxima problem Step-wise model selection

To perform the joint optimization of model structures and state sequences

2

Page 3: Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Outline

HMM-based speech synthesis system

Deterministic annealing EM (DAEM) algorithm

Step-wise model selection

Experiments

Conclusion & future work

3

Page 4: Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Overview of HMM-based system

4

Speech signal

Label

Contest-dependent HMMs & duration models Training part

Label

Spectralparameters

Synthesis part

Synthesizedspeech

Speechdatabase

Excitation parametersextraction

Spectral parametersextraction

Training of HMM

Parameter generationfrom HMM

Text analysis

Excitationparameters

Synthesisfilter

Excitationgeneration

TEXT

Page 5: Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Base techniques

Hidden semi-Markov Model (HSMM) HMM with explicit state duration probability dist. Estimate state output and duration probability dists.

STRAIGHT A high quality speech vocoding method Spectrum, F0, and aperiodicity measures

Parameter generation considering GV Calculate GV features from only speech region

excluding silence and pause Context dependent GV models

5

Page 6: Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Outline

HMM-based speech synthesis system

Deterministic annealing EM (DAEM) algorithm

Step-wise model selection

Experiments

Conclusion & future work

6

Page 7: Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

EM algorithm

Maximum likelihood (ML) criterion

Expectation Maximization (EM) algorithm

7

: Model parameter: Training data: HMM state seq.

・ E-step:

・ M-step:

Occur the local maxima problem

Page 8: Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

DAEM algorithm

Posterior probability

Model update process

8

: Temperature parameter

・ E-step:

・ M-step:

・ Increase temperature parameter

Page 9: Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Optimization of state sequence

Likelihood function in the DAEM algorithm

9

Time

All state sequences have uniform probability

State output probability State transition probability

Page 10: Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Optimization of state sequence

Likelihood function in the DAEM algorithm

10

Time

Change from uniform to sharp

State output probability State transition probability

Page 11: Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Optimization of state sequence

Likelihood function in the DAEM algorithm

11

Time

State output probability

Estimate reliable acoustic models

State transition probability

Page 12: Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Outline

HMM-based speech synthesis system

Deterministic annealing EM (DAEM) algorithm

Step-wise model selection

Experiments

Conclusion & future work

12

Page 13: Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Problem of context clustering

Context-dependent model Appropriate model structures are required

Decision tree based context clustering

Assumption: state occupancies are not changed State occupancies depend on model structures State sequences and model structures should be

optimized simultaneously13

/a/? Silence?

Vowel?

Page 14: Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Step-wise model selection

Gradually change the size of decision tree

Perform joint optimization of model structures and state sequences

Minimum Description Length (MDL) criterion

14: Dimension of feature vec.: Number of nodes

: Amount of training data assigned to the root node

: Tuning parameter

Page 15: Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Model training process

1. Estimate monophone models (DAEM) # of temperature parameter updates is 10 # of EM-steps at each temperature is 5

2. Select decision trees by the MDL criterion using the tuning parameter

3. Estimate context-dependent models (EM) # of EM-steps is 5

4. Decrease the tuning parameter Tuning parameter decreases as 4, 2, 1

5. Repeat from step. 2

15

Page 16: Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Outline

HMM-based speech synthesis system

Deterministic annealing EM (DAEM) algorithm

Step-wise model selection

Experiments

Conclusion & future work

16

Page 17: Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Speech analysis conditions

17

Training data10,000 utterances

(pruned by the alignment likelihood)

Sampling rate 48 kHz

Window F0-adaptive Gaussian window

Frame shift 5 ms

Feature vector

49-dim. STRAIGHT mel-cepstrum, log F0

26 band-filtered aperiodicity measure

+ Δ + ΔΔ (231 dimension)

HMM5-state left-to-right HSMM

without skip transition

Page 18: Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Likelihood & model structure

Average log likelihood of monophone model

Number of leaf nodes

Phone set: Unilex (58 phoneme) Number of leaf nodes (Full-context): 6,175,466

18

EM DAEM

Ave. Log Likelihood 227.716 229.174

Tuning parameter Mel-Cep. Log F0 Dur. Sum

Monophone 290 290 58 638

4 1,934 3,454 914 6,302

2 3,270 7,899 1,760 12,929

1 11,721 24,897 3,923 40,541

Page 19: Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Experimental results

Compare with the benchmark HMM-based system NIT system achieved the same performance High intelligibility

Compare with the benchmark unit-selection system Worse in speaker similarity Better in intelligibility

19

HMM-based(16kHz)

HMM-based(48kHz)

Unit-selection

Naturalness ― ― ―

Speaker similarity ― ― ×

Intelligibility ― ― ○

Page 20: Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Speech samples

Generate high intelligible speech Include voiced/unvoiced errors Need to improve feature extraction and excitation

20

Speech samples

Original

NIT system

Page 21: Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Conclusion

NIT HMM-based speech synthesis system DAEM algorithm

Overcome the local maxima problem Step-wise model selection

Perform joint optimization of state sequences and model structures

Generate high intelligible speech

Future work Improve feature extraction and excitation Investigate the schedule of temperature parameters

and step-wise model selection21

Page 22: Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011

Thank you