LEARNING AND MODELING UNIT EMBEDDINGS FOR IMPROVING HMM …home.ustc.edu.cn/~xiaozh/Interspeech2018/pdf/poster.pdf · 2018. 8. 28. · LEARNING AND MODELING UNIT EMBEDDINGS FOR IMPROVING

LEARNING AND MODELING UNIT EMBEDDINGS FOR IMPROVING HMM-BASED

UNIT SELECTION SPEECH SYNTHESISXiao Zhou, Zhen-Hua Ling, Zhi-Ping Zhou, Li-Rong Dai

National Engineering Laboratory for Speech and Language Information Processing,

University of Science and Technology of China, Hefei, P.R.China

Abstract

This paper presents a method using unit

embedding called unit vector for improving

the HMM-based unit selection

The model is composed of three DNNs

Unit2Vec learning the unit

embedding

Another two modeling cost

calculations

The concatenation cost can capture long-term

dependencies between adjacent units

The method improves the traditional HMM-

method

Unit vectors display phone-dependent

clustering properties

Unit2Vec – to learn unit embedding

Experiments

Average preference scores (%) on speech

quality using the Chinese corpus

N/P: no preference

Prop_All > Prop_TC (p-values < 0.001)

Prop_TC vs Baseline

Similar preference (p > 0.05)

Prop_TC make sound more

expressive than baseline but it

introduced more glitches.

Baseline Prop_TC Prop_All N/P p

40.00 31.67 - 28.33 0.0882

15.33 - 55.67 29.00 <0.001

- 15.00 52.67 32.33 <0.001

Calculate target and concatenation cost

• Considering the high dimension of linguistic

features, 𝑓𝑐 model extracts BN feature

• 𝐶targ measures overall acoustic difference

• 𝐶𝑐𝑜𝑛 measures the long-term dependencies

• Pruning search strategy reduces amount of

calculation

Conditions

Database

Chinese newspaper corpus with 12219

utterances from a female speaker.

training/validation/test set: 11608/611/100

Acoustic

Features

Composition: 12-order MCCs,1-order power,

1-order F0,and 1-order binary U/V flag.

Features contain its dynamic properties,

(12+1+1)*3+1=43 dims.

Systems Baseline Prop_TC Prop_All

Prop_TC replaces part of target cost in

Baseline

Prop_All further replaces part of

concatenation cost in Prop_TC

Average reconstruction errors of DNNs

Unit2Vec 𝒇𝒕 𝒇𝒄

MCD(dB) 2.1338 3.4267 3.3449

F0-RMSE

(Hz)18.4240 38.2256 35.1925

CORR 0.9673 0.8524 0.8746

V/UV error

(%)0.7049 5.2502 4.9525

MCD: distortion in mel-cepstrum

F0-RMSE and V/UV error: distortion in F0

CORR: Pearson correlation coefficient

𝑓𝑐 > 𝑓𝑡 because the 𝑓𝑐 using history

HMM-Based Unit Selection

Target linguistic feature

Candidate linguistic feature

𝐶targ 𝑢𝑛, 𝐶𝑛 = 𝑓𝑡 𝜔𝑛 − 𝑓𝑡 𝐶𝑛2

𝐶con 𝑢𝑛−𝑇 , ⋯ , 𝑢𝑛−1, 𝑢𝑛, 𝐶𝑛= 𝑓𝑡 𝜔𝑛 − 𝑓𝑐 𝑣𝑛−𝑇 , ⋯ , 𝑣𝑛−1 , 𝑐𝑛

2

Preceding

candidateunit vectors

(b) 𝑓𝑐 model.

(a) 𝑓𝑡 model

Conclusion

Unit2Vec has learned fixed-length vector for

unit embedding

𝑓𝑡 and 𝑓𝑐 has modeled cost calculation

Subjective evaluation demonstrate the

effectiveness of these models

𝑓𝑐 can handle long-term dependencies among

candidate units

Baseline HMM-based unit-selection system refer

this paper[1]

[1] Z.-H. Ling and R.-H. Wang “HMM-based hierarchical

unit selection combining kullback-leibler divergence with

likelihood criterion,” in Acoustics, Speech and Signal

Processing, 2007. ICASSP 2007. IEEE International

Conference on, vol. 4. IEEE, 2007, pp. IV–1245.

Visualization of the phone-dependent distributions of learnt unit vectors using t-SNE

Proposed Method

Save the matrix

Documents

LEARNING AND MODELING UNIT EMBEDDINGS FOR IMPROVING HMM …home.ustc.edu.cn/~xiaozh/Interspeech2018/pdf/poster.pdf · 2018. 8. 28. · LEARNING AND MODELING UNIT EMBEDDINGS FOR IMPROVING