Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
LEARNING AND MODELING UNIT EMBEDDINGS FOR IMPROVING HMM-BASED
UNIT SELECTION SPEECH SYNTHESISXiao Zhou, Zhen-Hua Ling, Zhi-Ping Zhou, Li-Rong Dai
National Engineering Laboratory for Speech and Language Information Processing,
University of Science and Technology of China, Hefei, P.R.China
Abstract
This paper presents a method using unit
embedding called unit vector for improving
the HMM-based unit selection
The model is composed of three DNNs
Unit2Vec learning the unit
embedding
Another two modeling cost
calculations
The concatenation cost can capture long-term
dependencies between adjacent units
The method improves the traditional HMM-
method
Unit vectors display phone-dependent
clustering properties
Unit2Vec – to learn unit embedding
Experiments
Average preference scores (%) on speech
quality using the Chinese corpus
N/P: no preference
Prop_All > Prop_TC (p-values < 0.001)
Prop_TC vs Baseline
Similar preference (p > 0.05)
Prop_TC make sound more
expressive than baseline but it
introduced more glitches.
Baseline Prop_TC Prop_All N/P p
40.00 31.67 - 28.33 0.0882
15.33 - 55.67 29.00 <0.001
- 15.00 52.67 32.33 <0.001
Calculate target and concatenation cost
• Considering the high dimension of linguistic
features, 𝑓𝑐 model extracts BN feature
• 𝐶targ measures overall acoustic difference
• 𝐶𝑐𝑜𝑛 measures the long-term dependencies
• Pruning search strategy reduces amount of
calculation
Conditions
Database
Chinese newspaper corpus with 12219
utterances from a female speaker.
training/validation/test set: 11608/611/100
Acoustic
Features
Composition: 12-order MCCs,1-order power,
1-order F0,and 1-order binary U/V flag.
Features contain its dynamic properties,
(12+1+1)*3+1=43 dims.
Systems Baseline Prop_TC Prop_All
Prop_TC replaces part of target cost in
Baseline
Prop_All further replaces part of
concatenation cost in Prop_TC
Average reconstruction errors of DNNs
Unit2Vec 𝒇𝒕 𝒇𝒄
MCD(dB) 2.1338 3.4267 3.3449
F0-RMSE
(Hz)18.4240 38.2256 35.1925
CORR 0.9673 0.8524 0.8746
V/UV error
(%)0.7049 5.2502 4.9525
MCD: distortion in mel-cepstrum
F0-RMSE and V/UV error: distortion in F0
CORR: Pearson correlation coefficient
𝑓𝑐 > 𝑓𝑡 because the 𝑓𝑐 using history
HMM-Based Unit Selection
Target linguistic feature
Candidate linguistic feature
𝐶targ 𝑢𝑛, 𝐶𝑛 = 𝑓𝑡 𝜔𝑛 − 𝑓𝑡 𝐶𝑛2
𝐶con 𝑢𝑛−𝑇 , ⋯ , 𝑢𝑛−1, 𝑢𝑛, 𝐶𝑛= 𝑓𝑡 𝜔𝑛 − 𝑓𝑐 𝑣𝑛−𝑇 , ⋯ , 𝑣𝑛−1 , 𝑐𝑛
2
Preceding
candidateunit vectors
(b) 𝑓𝑐 model.
(a) 𝑓𝑡 model
Conclusion
Unit2Vec has learned fixed-length vector for
unit embedding
𝑓𝑡 and 𝑓𝑐 has modeled cost calculation
Subjective evaluation demonstrate the
effectiveness of these models
𝑓𝑐 can handle long-term dependencies among
candidate units
Baseline HMM-based unit-selection system refer
this paper[1]
[1] Z.-H. Ling and R.-H. Wang “HMM-based hierarchical
unit selection combining kullback-leibler divergence with
likelihood criterion,” in Acoustics, Speech and Signal
Processing, 2007. ICASSP 2007. IEEE International
Conference on, vol. 4. IEEE, 2007, pp. IV–1245.
Visualization of the phone-dependent distributions of learnt unit vectors using t-SNE
Proposed Method
Save the matrix