1
LEARNING AND MODELING UNIT EMBEDDINGS FOR IMPROVING HMM-BASED UNIT SELECTION SPEECH SYNTHESIS Xiao Zhou, Zhen-Hua Ling, Zhi-Ping Zhou, Li-Rong Dai National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Hefei, P.R.China Abstract This paper presents a method using unit embedding called unit vector for improving the HMM-based unit selection The model is composed of three DNNs Unit2Vec learning the unit embedding Another two modeling cost calculations The concatenation cost can capture long-term dependencies between adjacent units The method improves the traditional HMM- method Unit vectors display phone-dependent clustering properties Unit2Vec to learn unit embedding Experiments Average preference scores (%) on speech quality using the Chinese corpus N/P: no preference Prop_All > Prop_TC (p-values < 0.001) Prop_TC vs Baseline Similar preference (p > 0.05) Prop_TC make sound more expressive than baseline but it introduced more glitches. Baseline Prop_TC Prop_All N/P p 40.00 31.67 - 28.33 0.0882 15.33 - 55.67 29.00 <0.001 - 15.00 52.67 32.33 <0.001 Calculate target and concatenation cost Considering the high dimension of linguistic features, model extracts BN feature targ measures overall acoustic difference measures the long-term dependencies Pruning search strategy reduces amount of calculation Conditions Database Chinese newspaper corpus with 12219 utterances from a female speaker. training/validation/test set: 11608/611/100 Acoustic Features Composition: 12-order MCCs,1-order power, 1-order F0,and 1-order binary U/V flag. Features contain its dynamic properties, (12+1+1)*3+1=43 dims. Systems Baseline Prop_TC Prop_All Prop_TC replaces part of target cost in Baseline Prop_All further replaces part of concatenation cost in Prop_TC Average reconstruction errors of DNNs Unit2Vec MCD(dB) 2.1338 3.4267 3.3449 F0-RMSE (Hz) 18.4240 38.2256 35.1925 CORR 0.9673 0.8524 0.8746 V/UV error (%) 0.7049 5.2502 4.9525 MCD: distortion in mel-cepstrum F0-RMSE and V/UV error: distortion in F0 CORR: Pearson correlation coefficient > because the using history HMM-Based Unit Selection Target linguistic feature Candidate linguistic feature targ , = 2 con ,⋯, −1 , , = ,⋯, −1 , 2 Preceding candidate unit vectors (b) model. (a) model Conclusion Unit2Vec has learned fixed-length vector for unit embedding and has modeled cost calculation Subjective evaluation demonstrate the effectiveness of these models can handle long-term dependencies among candidate units Baseline HMM-based unit-selection system refer this paper [1] [1] Z.-H. Ling and R.-H. Wang “HMM-based hierarchical unit selection combining kullback-leibler divergence with likelihood criterion,” in Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, vol. 4. IEEE, 2007, pp. IV1245. Visualization of the phone-dependent distributions of learnt unit vectors using t-SNE Proposed Method Save the matrix

LEARNING AND MODELING UNIT EMBEDDINGS FOR IMPROVING HMM …home.ustc.edu.cn/~xiaozh/Interspeech2018/pdf/poster.pdf · 2018. 8. 28. · LEARNING AND MODELING UNIT EMBEDDINGS FOR IMPROVING

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: LEARNING AND MODELING UNIT EMBEDDINGS FOR IMPROVING HMM …home.ustc.edu.cn/~xiaozh/Interspeech2018/pdf/poster.pdf · 2018. 8. 28. · LEARNING AND MODELING UNIT EMBEDDINGS FOR IMPROVING

LEARNING AND MODELING UNIT EMBEDDINGS FOR IMPROVING HMM-BASED

UNIT SELECTION SPEECH SYNTHESISXiao Zhou, Zhen-Hua Ling, Zhi-Ping Zhou, Li-Rong Dai

National Engineering Laboratory for Speech and Language Information Processing,

University of Science and Technology of China, Hefei, P.R.China

Abstract

This paper presents a method using unit

embedding called unit vector for improving

the HMM-based unit selection

The model is composed of three DNNs

Unit2Vec learning the unit

embedding

Another two modeling cost

calculations

The concatenation cost can capture long-term

dependencies between adjacent units

The method improves the traditional HMM-

method

Unit vectors display phone-dependent

clustering properties

Unit2Vec – to learn unit embedding

Experiments

Average preference scores (%) on speech

quality using the Chinese corpus

N/P: no preference

Prop_All > Prop_TC (p-values < 0.001)

Prop_TC vs Baseline

Similar preference (p > 0.05)

Prop_TC make sound more

expressive than baseline but it

introduced more glitches.

Baseline Prop_TC Prop_All N/P p

40.00 31.67 - 28.33 0.0882

15.33 - 55.67 29.00 <0.001

- 15.00 52.67 32.33 <0.001

Calculate target and concatenation cost

• Considering the high dimension of linguistic

features, 𝑓𝑐 model extracts BN feature

• 𝐶targ measures overall acoustic difference

• 𝐶𝑐𝑜𝑛 measures the long-term dependencies

• Pruning search strategy reduces amount of

calculation

Conditions

Database

Chinese newspaper corpus with 12219

utterances from a female speaker.

training/validation/test set: 11608/611/100

Acoustic

Features

Composition: 12-order MCCs,1-order power,

1-order F0,and 1-order binary U/V flag.

Features contain its dynamic properties,

(12+1+1)*3+1=43 dims.

Systems Baseline Prop_TC Prop_All

Prop_TC replaces part of target cost in

Baseline

Prop_All further replaces part of

concatenation cost in Prop_TC

Average reconstruction errors of DNNs

Unit2Vec 𝒇𝒕 𝒇𝒄

MCD(dB) 2.1338 3.4267 3.3449

F0-RMSE

(Hz)18.4240 38.2256 35.1925

CORR 0.9673 0.8524 0.8746

V/UV error

(%)0.7049 5.2502 4.9525

MCD: distortion in mel-cepstrum

F0-RMSE and V/UV error: distortion in F0

CORR: Pearson correlation coefficient

𝑓𝑐 > 𝑓𝑡 because the 𝑓𝑐 using history

HMM-Based Unit Selection

Target linguistic feature

Candidate linguistic feature

𝐶targ 𝑢𝑛, 𝐶𝑛 = 𝑓𝑡 𝜔𝑛 − 𝑓𝑡 𝐶𝑛2

𝐶con 𝑢𝑛−𝑇 , ⋯ , 𝑢𝑛−1, 𝑢𝑛, 𝐶𝑛= 𝑓𝑡 𝜔𝑛 − 𝑓𝑐 𝑣𝑛−𝑇 , ⋯ , 𝑣𝑛−1 , 𝑐𝑛

2

Preceding

candidateunit vectors

(b) 𝑓𝑐 model.

(a) 𝑓𝑡 model

Conclusion

Unit2Vec has learned fixed-length vector for

unit embedding

𝑓𝑡 and 𝑓𝑐 has modeled cost calculation

Subjective evaluation demonstrate the

effectiveness of these models

𝑓𝑐 can handle long-term dependencies among

candidate units

Baseline HMM-based unit-selection system refer

this paper[1]

[1] Z.-H. Ling and R.-H. Wang “HMM-based hierarchical

unit selection combining kullback-leibler divergence with

likelihood criterion,” in Acoustics, Speech and Signal

Processing, 2007. ICASSP 2007. IEEE International

Conference on, vol. 4. IEEE, 2007, pp. IV–1245.

Visualization of the phone-dependent distributions of learnt unit vectors using t-SNE

Proposed Method

Save the matrix