Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi

Embed Size (px)

Citation preview

  • Slide 1

Slide 2 Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute of Technology 28 August, 2011 Slide 3 Background (1/2) Model estimation Maximum likelihood (ML) approach Bayesian approach Estimation of posterior distributions Utilization of prior distributions Model selection according to the posterior probability Bayesian speech synthesis [Hashimoto et al., 08] Model estimation and speech parameter generation can be derived from the predictive distribution Represent the problem of speech synthesis 2 Slide 4 Background (2/2) Acoustic features common to every speaker Speaker Adaptive Training (SAT) [Anastasakos et al., 97] Shared Tree Clustering (STC) [Yamagishi et al., 03] Universal Background Model (UBM) [Reynolds et al., 00] Multi-speaker modeling with shared prior distributions and model structures Appropriate acoustic models can be estimated from training data of multiple speakers 3 Slide 5 Outline Bayesian speech synthesis Bayesian speech synthesis framework Variational Bayesian method Shared model structures and prior distributions Multi-speaker modeling Shared model structures Shared prior distributions Experiments Conclusion & Future work 4 Slide 6 Model training and speech synthesis Bayesian speech synthesis (1/3) 5 ML Bayes Training Synthesis Training & Synthesis : Label seq. for synthesis : Model parameters : Label seq. for training: Training data : Synthesis data Slide 7 Bayesian speech synthesis (2/3) Introduce model structure into the predictive dist. Model selection according to posterior probability Approximate the predictive distribution 6 : Model structure Slide 8 Bayesian speech synthesis (3/3) Predictive distribution (marginal likelihood) 7 Variational Bayesian method [Attias; 99] : Likelihood of synthesis data : Likelihood of training data : HMM state seq. for training data : Prior distribution for model parameters : HMM state seq. for synthesis data Slide 9 Estimate approximate posterior distribution Maximize the lower bound Variational Bayesian method (1/2) 8 Expectation w.r.t. : Approximate posterior distribution Jensens inequality Slide 10 Random variables are statistically independent Optimal posterior distributions Variational Bayesian method (2/2) 9 Iterative updates as the EM algorithm : Normalization terms Slide 11 Outline Bayesian speech synthesis Bayesian speech synthesis framework Variational Bayesian method Shared model structures and prior distributions Multi-speaker modeling Shared model structures Shared prior distributions Experiments Conclusion & Future work 10 Slide 12 Multi-speaker modeling Acoustic feature common to every speaker Use training data of multiple speakers SAT, STC, UBM, etc Estimate appropriate acoustic models Multi-speaker modeling Log marginal likelihood of multiple speakers 11 : Speaker index Shared model structures and prior distributions Slide 13 Maximize the sum of the lower bounds Shared model structures 12 yes no Is this phoneme a vowel? Same node STC based on Bayesian model selection Stopping condition: : Posterior dist. Slide 14 Prior distributions Conjugate prior distribution Determination using prior data Use training data of multiple speakers as prior data Speaker independent prior distribution 13 State output probability dist. Prior distribution : Hyper-parameter : Amount of prior data: Mean of prior data : Covariance of prior data: Tuning parameter Slide 15 Speaker adaptive prior distribution Maximize the sum of the lower bounds Prior distribution are estimated so that posterior distributions are estimated well Same fashion as the speaker adaptive training 14 : Posterior dist. for each speaker : Shared prior dist. Slide 16 Outline Bayesian speech synthesis Bayesian speech synthesis framework Variational Bayesian method Shared model structures and prior distributions Multi-speaker modeling Shared model structures Shared prior distributions Experiments Conclusion & Future work 15 Slide 17 16 Experimental conditions DatabaseNIT Japanese speech database Speaker5 male speakers Training data450 utterances for each speaker Test data53 utterances for each speaker Sampling rate16 kHz WindowBlackman window Frame size / shift25 ms / 5 ms Feature vector 24 mel-cepstrum + + and log F0 + + (78 dimension) HMM 5-state left-to-right HSMM without skip transition Slide 18 17 Comparison methods Compare 5 sharing methods Share among all speakers Model structurePrior distribution SD Tree Prior Tree-Prior (Speaker independent) Tree-SAT (Speaker adaptive) Slide 19 Experimental result 5-point Mean Opinion Score 18 Slide 20 Conclusions and future work Investigate sharing prior distributions and model structures among multiple speakers Estimate appropriate acoustic models Outperform single speaker modeling method Robust model structures Reliable prior distributions Future work Investigate speaker selection for sharing methods Experiments for comparing with conventional multi-speaker modeling 19 Slide 21 Thank you Slide 22 21 Slide 23 Background Bayesian speech synthesis [Hashimoto et al., 08] Represent the problem of speech synthesis All processes can be derived from predictive dist. Model structures affect the quality of speech Prior distributions affect Bayesian model selection Determination of prior distribution and model selection should be performed simultaneously Acoustic features common to every speaker Investigate prior distribution and model structure Share prior distributions and model structures among all speakers 22 Slide 24 Bayesian speech synthesis Model structure is marginalized Select model structure maximize the posterior Approximate the predictive distribution 23 : Prior distribution of model structure: Model structure Slide 25 Context clustering based on VB Maximize the marginal likelihood Construct decision tree 24 yes no Select question Gain of Stopping condition Split node based on gain : Is this phoneme a vowel? Slide 26 Multi-speaker modeling Data of multiple speakers can be used Marginal likelihood of multiple speakers Sum of lower bound of each speaker 25 Training data Synthesis data Slide 27 Shared model structures Model structures are selected for each speaker Sharing model structures among speakers Shared Tree Clustering (STC) based on the Bayesian model selection 26 Slide 28 Random variables are statistically independent Optimal posterior distributions Variational Bayesian method (2/2) 27 Iterative updates as the EM algorithm : Normalization terms Slide 29 Outline Bayesian speech synthesis Variational Bayesian method Speech parameter generation Problem & Proposed method Approximation of posterior Integration of training and synthesis processes Experiments Conclusion & Future work 28 Slide 30 Bayesian speech synthesis Maximize the lower bound of log marginal likelihood consistently Estimation of posterior distributions Speech parameter generation All processes are derived from the single predictive distribution 29 Slide 31 Approximation of posterior depends on synthesis data Synthesis data is not observed Assume that is independent of synthesis data [Hashimoto et al., 08] Estimate posterior from only training data 30 Slide 32 Use of generated data Problem: Posterior distribution depends on synthesis data Synthesis data is not observed Proposed method: Use generated data instead of observed data for estimating posterior distribution Iterative updates as the EM algorithm 31 Slide 33 Prior distribution Conjugate prior distribution Posterior dist. becomes a same family of dist. with prior dist. Determination using statistics of prior data 32 : Dimension of feature Covariance of prior data # of prior data Mean of prior data Conjugate prior distribution Likelihood function Slide 34 Relation between Bayes and ML Compare with the ML criterion Use of expectations of model parameters Can be solved by the same fashion of ML 33 Output dist. ML Bayes Slide 35 Impact of prior distribution Affect model selection as tuning parameters Require determination technique of prior dist. Maximize the marginal likelihood Lead to the over-fitting problem as the ML Tuning parameters are still required Determination technique of prior distribution using cross validation [Hashimoto; 08] 34 Slide 36 Speech parameter generation Speech parameter Consist of static and dynamic features Only static feature sequence is generated Speech parameter generation based on Bayesian approach Maximize the lower bound 35