11
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 10, OCTOBER 2013 2129 Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis Zhen-Hua Ling, Member, IEEE, Li Deng, Fellow, IEEE, and Dong Yu, Senior Member, IEEE Abstract—This paper presents a new spectral modeling method for statistical parametric speech synthesis. In the conventional methods, high-level spectral parameters, such as mel-cepstra or line spectral pairs, are adopted as the features for hidden Markov model (HMM)-based parametric speech synthesis. Our proposed method described in this paper improves the conventional method in two ways. First, distributions of low-level, un-transformed spectral envelopes (extracted by the STRAIGHT vocoder) are used as the parameters for synthesis. Second, instead of using single Gaussian distribution, we adopt the graphical models with multiple hidden variables, including restricted Boltzmann machines (RBM) and deep belief networks (DBN), to represent the distribution of the low-level spectral envelopes at each HMM state. At the synthesis time, the spectral envelopes are predicted from the RBM-HMMs or the DBN-HMMs of the input sentence following the maximum output probability parameter genera- tion criterion with the constraints of the dynamic features. A Gaussian approximation is applied to the marginal distribution of the visible stochastic variables in the RBM or DBN at each HMM state in order to achieve a closed-form solution to the parameter generation problem. Our experimental results show that both RBM-HMM and DBN-HMM are able to generate spec- tral envelope parameter sequences better than the conventional Gaussian-HMM with superior generalization capabilities and that DBN-HMM and RBM-HMM perform similarly due possibly to the use of Gaussian approximation. As a result, our proposed method can signicantly alleviate the over-smoothing effect and improve the naturalness of the conventional HMM-based speech synthesis system using mel-cepstra. Index Terms—Deep belief network, hidden Markov model, re- stricted Boltzmann machine, spectral envelope, speech synthesis. I. INTRODUCTION T HE hidden Markov model (HMM)-based parametric speech synthesis method has become a mainstream speech synthesis method in recent years [2], [3]. In this method, the spectrum, F0 and segment durations are modeled simul- taneously within a unied HMM framework [2]. At synthesis Manuscript received January 25, 2013; revised April 18, 2013; accepted June 13, 2013. Date of publication June 18, 2013; date of current version July 22, 2013. This work was supported in part by the National Nature Science Founda- tion of China under Grant No.61273032 and in part by the China Scholarship Council Young Teacher Study Abroad Project. This paper is the expanded ver- sion of the conference paper published in ICASSP-2013 [1]. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Chung-Hsien Wu. Z.-H. Ling is with the National Engineering Laboratory of Speech and Lan- guage Information Processing, University of Science and Technology of China, Hefei 230027, China (e-mail: [email protected]). L. Deng and D. Yu are with Microsoft Research, Redmond, WA 98052 USA (e-mail: [email protected]; [email protected]). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TASL.2013.2269291 time, these parameters are predicted so as to maximize their output probabilities from the HMM of the input sentence. The constraints of the dynamic features are considered during pa- rameter generation in order to guarantee the smoothness of the generated spectral and F0 trajectories [4]. Finally, the predicted parameters are sent to a speech synthesizer to reconstruct the speech waveforms. This method is able to synthesize highly intelligible and smooth speech sounds [5], [6]. However, the quality of the synthetic speech is degraded due to three main factors: limitations of the parametric synthesizer itself, inad- equacy of acoustic modeling used in the synthesizer, and the over-smoothing effect of parameter generation [7]. Many improved approaches have been proposed to overcome the disadvantages of these three factors. In terms of the speech synthesizer, STRAIGHT [8], as a high-performance speech vocoder, has been widely used in current HMM-based speech synthesis systems. It follows the source-lter model of speech production. In order to represent the excitation and vocal tract characteristics separately, F0 and a smooth spectral envelope without periodicity interference are extracted at each frame. Then, mel-cepstra [5] or line spectral pairs [6] can be derived from the spectral envelopes of training data for the following HMM modeling. During synthesis, the generated spectral parameters are used either to reconstruct speech waveforms directly or to recover the spectral envelopes for further speech reconstruction by STRAIGHT. Acoustic modeling is another key component of the HMM-based parametric speech synthesis. In the common spectral modeling methods, the probability density functions (PDF) of each HMM state is represented by a single Gaussian distribution with diagonal covariance matrix and the distribu- tion parameters are estimated under the maximum likelihood (ML) criterion [2]. Because the single Gaussian distributions are used as the state PDFs, the outputs of maximum output probability parameter generation tend to distribute near the modes (also the means) of the Gaussians, which are estimated by averaging observations with similar context descriptions in the ML training. Although this averaging process improves the robustness of parameter generation, the detailed characteristics of the spectral parameters are lost. Therefore, the reconstructed spectral envelopes are over-smoothed, which leads to a mufed voice quality in the synthetic speech. The existing rene- ments on acoustic modeling include increasing the number of Gaussians for each HMM state [4], reformulating HMM as a trajectory model [9], improving the model training criterion by minimizing the generation error [10]. 1558-7916/$31.00 © 2013 IEEE

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ... · 1Here, the “level” refers to the steps of signal processing procedures in-volved in the spectral feature extraction. The

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ... · 1Here, the “level” refers to the steps of signal processing procedures in-volved in the spectral feature extraction. The

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 10, OCTOBER 2013 2129

Modeling Spectral Envelopes Using RestrictedBoltzmann Machines and Deep Belief Networksfor Statistical Parametric Speech SynthesisZhen-Hua Ling, Member, IEEE, Li Deng, Fellow, IEEE, and Dong Yu, Senior Member, IEEE

Abstract—This paper presents a new spectral modeling methodfor statistical parametric speech synthesis. In the conventionalmethods, high-level spectral parameters, such as mel-cepstra orline spectral pairs, are adopted as the features for hidden Markovmodel (HMM)-based parametric speech synthesis. Our proposedmethod described in this paper improves the conventional methodin two ways. First, distributions of low-level, un-transformedspectral envelopes (extracted by the STRAIGHT vocoder) areused as the parameters for synthesis. Second, instead of usingsingle Gaussian distribution, we adopt the graphical modelswith multiple hidden variables, including restricted Boltzmannmachines (RBM) and deep belief networks (DBN), to representthe distribution of the low-level spectral envelopes at each HMMstate. At the synthesis time, the spectral envelopes are predictedfrom the RBM-HMMs or the DBN-HMMs of the input sentencefollowing the maximum output probability parameter genera-tion criterion with the constraints of the dynamic features. AGaussian approximation is applied to the marginal distributionof the visible stochastic variables in the RBM or DBN at eachHMM state in order to achieve a closed-form solution to theparameter generation problem. Our experimental results showthat both RBM-HMM and DBN-HMM are able to generate spec-tral envelope parameter sequences better than the conventionalGaussian-HMM with superior generalization capabilities andthat DBN-HMM and RBM-HMM perform similarly due possiblyto the use of Gaussian approximation. As a result, our proposedmethod can significantly alleviate the over-smoothing effect andimprove the naturalness of the conventional HMM-based speechsynthesis system using mel-cepstra.

Index Terms—Deep belief network, hidden Markov model, re-stricted Boltzmann machine, spectral envelope, speech synthesis.

I. INTRODUCTION

T HE hidden Markov model (HMM)-based parametricspeech synthesis method has become a mainstream

speech synthesis method in recent years [2], [3]. In this method,the spectrum, F0 and segment durations are modeled simul-taneously within a unified HMM framework [2]. At synthesis

Manuscript received January 25, 2013; revised April 18, 2013; accepted June13, 2013. Date of publication June 18, 2013; date of current version July 22,2013. This work was supported in part by the National Nature Science Founda-tion of China under Grant No.61273032 and in part by the China ScholarshipCouncil Young Teacher Study Abroad Project. This paper is the expanded ver-sion of the conference paper published in ICASSP-2013 [1]. The associate editorcoordinating the review of this manuscript and approving it for publication wasProf. Chung-Hsien Wu.Z.-H. Ling is with the National Engineering Laboratory of Speech and Lan-

guage Information Processing, University of Science and Technology of China,Hefei 230027, China (e-mail: [email protected]).L. Deng and D. Yu are with Microsoft Research, Redmond, WA 98052 USA

(e-mail: [email protected]; [email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TASL.2013.2269291

time, these parameters are predicted so as to maximize theiroutput probabilities from the HMM of the input sentence. Theconstraints of the dynamic features are considered during pa-rameter generation in order to guarantee the smoothness of thegenerated spectral and F0 trajectories [4]. Finally, the predictedparameters are sent to a speech synthesizer to reconstruct thespeech waveforms. This method is able to synthesize highlyintelligible and smooth speech sounds [5], [6]. However, thequality of the synthetic speech is degraded due to three mainfactors: limitations of the parametric synthesizer itself, inad-equacy of acoustic modeling used in the synthesizer, and theover-smoothing effect of parameter generation [7].Many improved approaches have been proposed to overcome

the disadvantages of these three factors. In terms of the speechsynthesizer, STRAIGHT [8], as a high-performance speechvocoder, has been widely used in current HMM-based speechsynthesis systems. It follows the source-filter model of speechproduction. In order to represent the excitation and vocal tractcharacteristics separately, F0 and a smooth spectral envelopewithout periodicity interference are extracted at each frame.Then, mel-cepstra [5] or line spectral pairs [6] can be derivedfrom the spectral envelopes of training data for the followingHMM modeling. During synthesis, the generated spectralparameters are used either to reconstruct speech waveformsdirectly or to recover the spectral envelopes for further speechreconstruction by STRAIGHT.Acoustic modeling is another key component of the

HMM-based parametric speech synthesis. In the commonspectral modeling methods, the probability density functions(PDF) of each HMM state is represented by a single Gaussiandistribution with diagonal covariance matrix and the distribu-tion parameters are estimated under the maximum likelihood(ML) criterion [2]. Because the single Gaussian distributionsare used as the state PDFs, the outputs of maximum outputprobability parameter generation tend to distribute near themodes (also the means) of the Gaussians, which are estimatedby averaging observations with similar context descriptions inthe ML training. Although this averaging process improves therobustness of parameter generation, the detailed characteristicsof the spectral parameters are lost. Therefore, the reconstructedspectral envelopes are over-smoothed, which leads to a muffledvoice quality in the synthetic speech. The existing refine-ments on acoustic modeling include increasing the number ofGaussians for each HMM state [4], reformulating HMM as atrajectory model [9], improving the model training criterion byminimizing the generation error [10].

1558-7916/$31.00 © 2013 IEEE

Page 2: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ... · 1Here, the “level” refers to the steps of signal processing procedures in-volved in the spectral feature extraction. The

2130 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 10, OCTOBER 2013

In order to alleviate the over-smoothing effect, many im-proved parameter generation methods have also been proposed,such as modifying the parameter generation criterion by inte-grating a global variancemodel [11] or minimizingmodel diver-gences [12], post-filtering after parameter generation [6], [13],using real speech parameters or segments to generate the speechwaveform [14], [15], or sampling trajectories from the predic-tive distribution [16], [17], and so on. In this paper, we pro-pose a new spectral modeling method which copes with thefirst two factors mentioned above. First, the raw spectral en-velopes extracted by the STRAIGHT vocoder are utilized di-rectly without further deriving spectral parameters from themduring feature extraction. Comparing with the high-level1 spec-tral parameters, such as mel-cepstra or line spectral pairs, thelow-level spectral envelopes are more physically meaningfuland more directly related with the subjective perception on thespeech quality. Thus, the influence of spectral parameter ex-traction on the spectral modeling can be avoided. Similar ap-proach can be found in [18], where the spectral envelopes de-rived from the harmonic amplitudes are adopted to replace themel-cepstra for HMM-based Arabic speech synthesis and thenaturalness improvement can be achieved. Second, the graph-ical models withmultiple hidden layers, such as restricted Boltz-mann machines (RBM) [19] and deep belief networks (DBN)[20], are introduced to represent the distribution of the spec-tral envelopes at each HMM state instead of single Gaussiandistribution. An RBM is a bipartite undirected graphical modelwith a two-layer architecture and a DBN contains more hiddenlayers, which can be estimated using a stack of RBMs. Bothof these two models are better in describing the distributionof high-dimensional observations with cross-dimension corre-lations, i.e., the spectral envelopes, than the single Gaussiandistribution and Gaussian mixture model (GMM). The acousticmodeling method which describes the production, perceptionand distribution of speech signals is always an important re-search topic in speech signal processing [21]. In recent years,RBMs and DBNs have been successfully applied to modelingspeech signals, such as spectrogram coding [22], speech recog-nition [23], [24], and acoustic-articulatory inversion mapping[25], where they mainly act as the pre-training methods for adeep autoencoder or a deep neural network (DNN). The archi-tectures used in deep learning as applied to speech processinghave been motivated by the multi-layered structures in bothspeech production and perception involving phonological fea-tures, motor control, articulatory dynamics, and acoustic and au-ditory parameters [26], [27]. The approaches of applying RBMs,DBNs, and other deep learning methods to the statistical para-metric speech synthesis have also been studied very recently [1],[28]–[30]. In [28], a DNN-based statistical parametric speechsynthesis method is presented, which maps the input context in-formation towards the acoustic features using a neural networkwith deep structures. In [29], a DNN which is pre-trained by the

1Here, the “level” refers to the steps of signal processing procedures in-volved in the spectral feature extraction. The high-level spectral parametersare commonly derived from the low-level ones by functional representationand parameterization.

DBN learning is adopted as a feature extractor for the Gaussianprocess based F0 contour prediction. Furthermore, RBMs andDBNs can be used as density models instead of the DNN ini-tialization methods for the speech synthesis application. In [30],a single DBN model is trained to represent the joint distribu-tion between the tonal syllable ID and the acoustic features. In[1], a set of RBMs are estimated to describe the distributions ofthe spectral envelopes in the context-dependent HMM states. Inthis paper, we extend our previous work in [1] by incorporatingthe dynamic features of spectral envelopes into the RBM mod-eling and developing RBMs to DBNs which has more layers ofhidden units.This paper is organized as follows. In Section II, we will

briefly review the basic techniques of RBMs and DBNs. InSection III, we will describe the details of our proposed method.Section IV reports our experimental results. Section V gives theconclusion and the discussion on our future work.

II. RESTRICTED BOLTZMANN MACHINESAND DEEP BELIEF NETWORKS

A. Restricted Boltzmann Machines

An RBM is a kind of bipartite undirected graphical model(i.e., Markov random field) which is used to describe the de-pendency among a set of random variables using a two-layerarchitecture [19]. In this model, the visible stochastic units

are connected to the hidden stochastic unitsas shown in Fig. 1(a), where and are the

numbers of units of the visible and hidden layers respectively,and means the matrix transpose. Assumingand are both binary stochastic variables, the en-ergy function of the state is defined as

(1)

where represents the symmetric interaction between and, and are bias terms. The model parameters are com-

posed of , , and. The joint distribution over the visible and hidden

units is defined as

(2)

where

(3)

is the partition function which can be estimated using the an-nealed importance sampling (AIS) method [31]. Therefore, theprobability density function over the visible vector can be cal-culated as

(4)

Given a training set, the RBMmodel parameters canbe estimated by maximum likelihood learning using the con-trastive divergence (CD) algorithm [32].

Page 3: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ... · 1Here, the “level” refers to the steps of signal processing procedures in-volved in the spectral feature extraction. The

LING et al.: MODELING SPECTRAL ENVELOPES USING RBMs AND DBNs 2131

Fig. 1. The graphical model representations for (a) an RBM and (b) a three-hidden-layer DBN.

RBM can also be applied to model the distribution ofreal-valued data (e.g., the speech parameters) by adopting itsGaussian-Bernoulli form, whichmeans are real-valuedand are binary. Thus, the energy function of thestate is defined as

(5)

where the variance parameters are commonly fixed to a pre-determined value instead of learning from the training data [33].

B. Deep Belief Networks

A deep belief network (DBN) is a probabilistic generativemodel which is composed of many layers of hidden units [20].The graphical model representation for a three-hidden-layerDBN is shown in Fig. 1(b). In this model, each layer capturesthe correlations among the activities of hidden features in thelayer below. The top two layers of the DBN form an undirectedbipartite graph. The lower layers form a directed graph with atop-down direction to generate the visible units. Mathemati-cally, the joint distribution over the visible and all hidden unitscan be written as

(6)

where is the hidden stochastic vector ofthe -th hidden layer, is the dimensionality of , and is thenumber of hidden layers. The joint distributionis represented by an RBM as (2) with the weight matrixand the bias vectors and . and

are represented by sigmoid belief networks[34]. Each sigmoid belief network is described by a weight ma-trix and a bias vector . Assuming are real-valued and

are binary, the dependency between andin the sigmoid belief network is described by

(7)

where denotes a Gaussian distribution; andturns to an identity matrix when are fixed to 1 during model

training. For , the dependency betweentwo adjacent hidden layers is represented by

(8)

where is the sigmoid function. Foran -hidden-layer DBN, its model parameters are composedof . Further, themarginal distribution of the visible variables for a DBN can bewritten as

(9)

Given the training samples of the visible units, it is difficultto estimate the model parameters of a DBN directly underthe maximum likelihood criterion due to the complex modelstructure with multiple hidden layers. Therefore, a greedylearning algorithm has been proposed and popularly appliedto train the DBN in a layer-by-layer manner [20]. A stackof RBMs are used in this algorithm. Firstly, it estimates theparameters of the first layer RBM to model thevisible training data. Then, it freezes the parametersof the first layer and draws samples from to train thenext layer RBM , where

(10)

This training procedure is conducted recursively until it reachesthe top layer and gets . It has been proved thatthis greedy learning algorithm can improve the lower bound onthe log-likelihood of the training samples by adding each newhidden layer [20], [31]. Once the model parameters are esti-mated, to calculate the log-probability that a DBN assigns totraining or test data by (9) directly is also computationally in-tractable. A lower bound on the log-probability can be estimatedby combining the AIS-based partition function estimation withapproximate inference [31].

III. SPECTRAL ENVELOPE MODELINGUSING RBMs AND DBNs

A. HMM-Based Parametric Speech Synthesis

At first, the conventional HMM-based parametric speech syn-thesis method is briefly reviewed. It consists of a training stageand a synthesis stage. During training, the F0 and spectral pa-rameters are extracted from the waveforms contained in thetraining set. Then a set of context-dependent HMMs are es-timated to maximize the likelihood function for the trainingacoustic features. Here is the observa-tion feature sequence and is the length of the sequence. Theobservation feature vector for the -th frame typi-cally consists of static acoustic parameters and theirdelta and acceleration components as

(11)

Page 4: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ... · 1Here, the “level” refers to the steps of signal processing procedures in-volved in the spectral feature extraction. The

2132 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 10, OCTOBER 2013

where is the dimension of the static component; the dynamiccomponents are commonly calculated as

(12)

(13)

and

(14)

(15)

Therefore, the complete feature sequence can be consideredto be a linear transform of the static feature sequence

as

(16)

where is determined by the delta and ac-celeration calculation functions in (12)–(15)[4]. A multi-spaceprobability distribution (MSD) [35] is applied to incorporatea distribution for F0 into the probabilistic framework of theHMM considering that F0 is only defined for voiced speechframes. In order to deal with the data-sparsity problem of thecontext-dependent model training with extensive context fea-tures, a decision-tree-basedmodel clustering technique that usesa minimum description length (MDL) criterion [36] to guidethe tree construction is adopted after initial training of the con-text-dependent HMMs. Next, a state alignment is conductedusing the trained HMMs to train context-dependent state dura-tion probabilities [2] for state duration prediction. A single-mix-ture Gaussian distribution is used to model the duration prob-ability for each state. A decision-tree-based model clusteringtechnique is similarly applied to these duration distributions.At the synthesis stage, the maximum output probability pa-

rameter generation algorithm is used to generate acoustic pa-rameters [4]. The result of front-end linguistic analysis on theinput text is used to determine the sentence HMM . The statesequence is predicted using the trainedstate duration probabilities [2]. Then, the sequence of speechfeatures are predicted by maximizing . Consideringthe constraints between static and dynamic features as in (16),the parameter generation criterion can be rewritten as

(17)

where is the generated static feature sequence. If the emis-sion distribution of each HMM state is represented by a singleGaussian distribution, the closed-form solution to (17) can bederived. By setting

(18)

we obtain

(19)

where andare the mean vector and covariance matrix of the sentence asdecided by the state sequence [4].

Fig. 2. Flowchart of our proposed method. The modules in solid lines rep-resent the procedures of the conventional HMM-based speech synthesis usinghigh-level spectral parameters, where “CD-HMM” stands for “Context-Depen-dent HMM.” The modules in dash lines describe the add-on procedures of ourproposed method for modeling the spectral envelopes using RBMs or DBNs.

B. Spectral Envelope Modeling and Generation Using RBMand DBN

In this paper, we improve the conventional spectral modelingmethod in the HMM-based parametric speech synthesis fromtwo aspects. First, the raw spectral envelopes extracted by theSTRAIGHT vocoder are modeled directly without further de-riving high-level spectral parameters. Second, the RBM andDBN models are adopted to replace the single Gaussian dis-tribution at each HMM state. In order to simplify the modeltraining with high-dimensional spectral features, the decisiontrees for model clustering and the state alignment results are as-sumed to be given when the spectral envelopes are modeled.Thus, we can focus on comparing the performance of differentmodels on the clustered model estimation. In current implemen-tation, the conventional context-dependent model training usinghigh-level spectral parameters and single Gaussian state PDFsis conducted at first to achieve the model clustering and statealignment results.The flowchart of our proposed method is shown in Fig. 2.

During the acoustic feature extraction using STRAIGHTvocoder, the original linear frequency spectral envelopes2 arestored besides the spectral parameters. The context-dependentHMMs for conventional spectral parameters and F0 featuresare firstly estimated according to the method introduced inSection III-A. A single Gaussian distribution is used to modelthe spectral parameters at each HMM state. Next, a state align-ment to the acoustic features is performed. The state boundariesare used to gather the spectral envelopes for each clusteredcontext-dependent state. Similar to the high-level spectralparameters, the feature vector of the spectral envelope at eachframe consists of static, velocity, and acceleration components

2The mel-frequency spectral envelopes can also be used here to representthe speech perception properties. In this paper, we adopt the linear frequencyspectral envelope because it is the most original description of the vocal tractcharacters without any prior knowledge and assumption on the spectral param-eterization and speech perception.

Page 5: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ... · 1Here, the “level” refers to the steps of signal processing procedures in-volved in the spectral feature extraction. The

LING et al.: MODELING SPECTRAL ENVELOPES USING RBMs AND DBNs 2133

as (11)–(15). Then, an RBM or a DBN is estimated under themaximum likelihood criterion for each state according to themethods introduced in Section II. The model estimation of theRBMs or the DBNs is conducted only once using the fixedstate boundaries. Finally, the context-dependent RBM-HMMsor DBN-HMMs can be constructed for modeling the spectralenvelopes.At synthesis time, the same criterion in (17) is followed to

generate the spectral envelopes. The optimal sequence of spec-tral envelopes are estimated by maximizing the output proba-bility from the RBM-HMM or the DBN-HMM of the input sen-tence. When single Gaussian distributions are adopted as thestate PDFs, there is a closed-form solution as shown in (19) tothis maximum output probability parameter generation with theconstraints of dynamic features once the state sequence has beendetermined [4]. However, the marginal distribution defined in(4) for an RBMor in (9) for a DBN is muchmore complex than asingle Gaussian, which makes the closed-form solution imprac-tical. Therefore, a Gaussian approximation is applied before pa-rameter generation to simply the problem. For each HMM state,a Gaussian distribution is constructed, where isthe spectral envelope feature vector containing static, velocity,and acceleration components;

(20)

is the mode estimated for each RBM or DBN and is de-fined as (4) or (9); is a diagonal covariance matrix estimatedby calculating the sample covariances given the training sam-ples of the state. These Gaussian distributions are used to re-place the RBMs or the DBNs as the state PDFs at synthesistime. Therefore, the conventional parameter generation algo-rithm with the constraints of dynamic features can be followedto predict the spectral envelopes by solving a group of linear(19). By incorporating the dynamic features of the spectral en-velopes duringmodel training and parameter generation, tempo-rally smooth spectral trajectories can be generated at synthesistime. The detailed algorithms of the mode estimation in (20) foran RBM and a DBN model will be introduced in the followingsubsections.

C. Estimating RBM Mode

Here, we consider the RBM of the Gaussian-Bernoulli formbecause the spectral envelope features are real-valued. Giventhe estimated model parameters of an RBM, theprobability density function (4) over the visible vector canbe further calculated as3

3The variance parameters in (5) are fixed to 1 to simplify the notation.

(21)

where denotes the -th column of matrix . Because thereis no closed-form solution to solve (20) for an RBM, the gradientdescent algorithm is adopted here, i.e.,

(22)

where denotes the number of iteration; is the step size;

(23)

Thus, the estimated mode of the RBM model is determinedby a non-linear transform of the model parameters .In contrast to the single Gaussian distribution, this mode is nolonger the Gaussian mean which is estimated by averaging thecorresponding training vectors under the maximum likelihoodcriterion.Because the likelihood of an RBM ismultimodal, the gradient

descent optimization in (22) only leads to a local maximum andthe result is sensitive to the initialization of . In order to finda representative , we firstly calculate the means of the con-ditional distributions for all training vectors . Thesemeans are averaged and made binary using a fixed threshold of0.5 to get . Then, the initial for the iteratively updatingin (22) is set as the mean of .

D. Estimating DBN Mode

Estimating the mode of a DBN model is more complex thandealing with an RBM. The marginal distribution of the visiblevariables in (9) can be rewritten as

(24)

(25)

where is described in (7) and can be calculatedby applying

(26)

recursively for each from to 2. The conditional distri-bution is represented by (8) and is themarginal distribution (4) of the RBM representing the top twohidden layers. Similar to the RBMmode estimation, the gradientdescent algorithm can be applied here to optimize (25) once thevalues of are determined for all possible . However,this will lead to an exponential complexity with respect to the

Page 6: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ... · 1Here, the “level” refers to the steps of signal processing procedures in-volved in the spectral feature extraction. The

2134 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 10, OCTOBER 2013

number of hidden units at each hidden layer due to the summa-tion in (25) and (26). Thus, such optimization becomes imprac-tical unless the number of hidden units is reasonably small.In order to get a practical solution to the DBN mode estima-

tion, an approximation is made to (24) in this paper. The sum-mation over all possible values of the hidden units is simplifiedby considering only the optimal hidden vectors at each layer,i.e.,

(27)

where

(28)

and for each

(29)

The joint distribution in (28) is modeled by aBernoulli-Bernoulli RBM according to the definition of DBNin Section II-B. Because and are both binary sto-chastic vectors, the iterated conditional modes (ICM) [37] al-gorithm is adopted to solve (28). This algorithm determines theconfiguration that maximizes the joint probability of a Markovrandom field by iteratively maximizing the probability of eachvariable conditioned on the rest. Applying the ICM algorithmhere, we just update by maximizing and up-date by maximizing iteratively. Both of thetwo conditional distributions are multivariate Bernoulli distri-bution without cross-dimension correlation [20]. The optimalconfiguration at each step can be determined simply by applyinga threshold of 0.5 for each binary unit. The initial of theiteratively updating is set to be the which is obtained bysolving (28) for the DBN with hidden layers.For each from to 2, (29) can be solved recur-

sively according to the conditional distribution in (8). Afterare determined, the mode of the DBN can be

estimated by substituting (27) into (20). Consideringis a Gaussian distribution as (7), we have

(30)

IV. EXPERIMENTS

A. Experimental Conditions

A1-hour Chinese speech database produced by a professionalfemale speaker was used in our experiments. It consisted of1,000 sentences together with the segmental and prosodic la-bels. 800 sentences were selected randomly for training and theremaining 200 sentences were used as a test set. The waveformswere recorded in 16 kHz/16 bit format.When constructing the baseline system, 41-order mel-cep-

stra (including 0-th coefficient for frame power) were derivedfrom the spectral envelope by STRAIGHT analysis at 5 msframe shift. The F0 and spectral features consisted of static,velocity, and acceleration components. A 5-state left-to-right

Fig. 3. The cumulative probability curve for the number of frames belongingto each context-dependent state. The arrows indicate the numbers of frames ofthe three example states used for the analysis in Section IV-B and Fig. 4.

HMM structure with no skips was adopted to train the context-dependent phone models. The covariance matrix of the singleGaussian distribution at each HMM state was set to be diagonal.After the decision-tree-based model clustering, we got 1,612context-dependent states in total for the mel-cepstral stream.The model parameters of these states were estimated by max-imum likelihood training.In the spectral envelope modeling, the FFT length of

the STRAIGHT analysis was set to 1024 which led tovisible units in the RBMs and DBNs

corresponding to the spectral amplitudes within the frequencyrange of together with their dynamic components. Afterthe HMMs for the mel-cepstra and F0 features were trained, astate alignment was conducted on the training set and the testset to assign the frames to each state for the spectral envelopemodeling and testing. The cumulative probability curve forthe number of frames belonging to each context-dependentstate is illustrated in Fig. 3. From this figure, we can see thatthe numbers of training samples vary a lot among differentstates. For each context-dependent state, the logarithmizedspectral amplitudes at each frequency point were normalizedto zero mean and unit variance. CD learning with 1-step Gibbssampling (CD1) was adopted for the RBM training and thelearning rate was 0.0001. The batch size was set to 10 and 200epochs were executed for estimating each RBM. The DBNswere estimated following the greedy layer-by-layer trainingalgorithm introduced in Section II-B.

B. Comparison Between GMM and RBM as State PDFs

At first we compared the performance of the GMM and theRBM in modeling the distribution of mel-cepstra and spectralenvelopes for an HMM state. Three representative states wereselected for this experiment, which have 270, 650, 1530 trainingframes and 60, 130, 410 test frames respectively. As shown inFig. 3, the numbers of training frames of these three states cor-respond to the 0.1, 0.5, and 0.9 cumulative probabilities whichare calculated over the numbers of the training frames of all the1,612 context-dependent states. GMMs and RBMs were trainedunder the maximum likelihood criterion to model these threestates. The covariance matrices in the GMMs were set to be

Page 7: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ... · 1Here, the “level” refers to the steps of signal processing procedures in-volved in the spectral feature extraction. The

LING et al.: MODELING SPECTRAL ENVELOPES USING RBMs AND DBNs 2135

Fig. 4. The average log-probabilities on the training and test sets when modeling (a) the mel-cepstra and (b) the spectral envelopes of state A (left column), state B(middle column), and state C (right column) using different models. The number of training samples belonging to these three selected states are indicated in Fig. 3.

diagonal and the number of Gaussian mixtures varied from 1to 64. The number of hidden units in the RBMs varied from 1to 1,000. The average log-probabilities on the training and testsets for different models and states are shown in Fig. 4 for themel-cepstra and the spectral envelopes respectively. Examiningthe difference between the training and test log-probabilities forboth the mel-cepstra and the spectral envelopes, we see that theGMMs have a clear tendency of over-fitting with the increasingof model complexity. This over-fitting effect becomes less sig-nificant when a larger training set is available. On the otherhand, the RBM shows consistently good generalization abilitywith the increasing of the number of hidden units. This can beattribute to utilizing the binary hidden units which create a in-formation bottleneck and act as an effective regularizer duringmodel training.The differences between the test log-probabilities of the best

GMM or RBMmodels and the single Gaussian distributions forthe three states are listed in Table I. From Fig. 4 and Table I,we can see that the model accuracy improvements obtained byusing the density models that are more complex than a singleGaussian distribution are relatively small when the mel-cep-stra are used for spectral modeling. Once the spectral envelopesare used, such improvements become much more significantfor both the GMM and RBM models. Besides, the RBM alsogives much higher log-probability to the test data than the GMMwhen modeling the spectral envelopes. These results can be at-tributed to that the mel-cepstral analysis is a kind of decorrela-tion processing to the spectrums. A GMM with multiple com-ponents is able to describe the inter-dimensional correlations ofa multivariate distribution to some extend even if the diagonalcovariance matrices are used. An RBM with hidden unitscan be considered as a GMM with structured mixture com-ponents according to (21). Therefore, it is good at analyzingthe latent patterns embedded in the high-dimensional raw data

Fig. 5. Visualization of the estimated weight matrices in the RBMs whenmodeling the spectral envelopes for the three states. The number of hidden unitsis 50 and only the first 513 rows of the weight matrices are drawn. Each columnin the gray-scale figures corresponds to the weights connecting one hidden unitwith the 513 visible units which compose the static component of the spectralenvelope feature vector.

with inter-dimensional correlations. Fig. 5 shows the estimatedweight matrices in the RBMs when modeling the spectralenvelopes for the three states. We can see that the weight ma-trices are somewhat sparse, indicating each hidden unit tries tocapture the characteristics of the spectral envelope in some spe-cific frequency bands. This is similar to a frequency analysisfor spectral envelopes which makes use of the amplitudes of thecritical frequency bands context-dependently.For the spectral envelope modeling, we can further improve

the model accuracy by training RBMs layer-by-layer and con-structing a DBN. The average log-probabilities on the trainingand test sets when modeling the spectral envelopes using an

Page 8: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ... · 1Here, the “level” refers to the steps of signal processing procedures in-volved in the spectral feature extraction. The

2136 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 10, OCTOBER 2013

TABLE ITHE DIFFERENCES BETWEEN THE TEST LOG-PROBABILITIES OF THE BESTGMM OR RBM MODELS AND THE SINGLE GAUSSIAN DISTRIBUTIONSFOR THE THREE SELECTED STATES. THE NUMBERS IN THE BRACKETSINDICATE THE NUMBERS OF GAUSSIAN MIXTURES FOR THE GMMSAND THE NUMBERS OF HIDDEN UNITS THE RBMS WHICH LEAD TO

THE HIGHEST LOG-PROBABILITIES ON THE TEST SET

TABLE IITHE AVERAGE LOG-PROBABILITIES ON THE TRAINING AND TEST SETS WHENMODELING THE SPECTRAL ENVELOPES USING AN RBM OF 50 HIDDEN UNITSAND A TWO-HIDDEN-LAYER DBN OF 50 HIDDEN UNITS AT EACH LAYER

TABLE IIISUMMARY OF DIFFERENT SYSTEMS CONSTRUCTED IN THE EXPERIMENTS

RBM and a two-hidden-layer DBN are compared in Table II.Here, the lower bound estimation [31] to the log-probability ofa DBN is adopted. From this table, we can observe a monotonicincrease of test log-probabilities by using more hidden layers.

C. System Construction

Seven systemswere constructed whose performance we com-pared in our experiments. The definitions of these systems areexplained in Table III. As shown in Table I, the model accu-racy improvement achieved by adopting the distributions morecomplicated than the single Gaussian is not significant when themel-cepstra are used as spectral features. Therefore, we focuson the performance of spectral envelope modeling using dif-ferent forms of state PDFs in our experiments. Considering thecomputational complexity of training state PDFs for all con-text-dependent states, the maximum number of hidden units inthe RBM and DBN models were set to 50. All these systemsshared the same decision trees for model clustering and the samestate boundaries which were derived from the Baseline system.The F0 and duration models of the seven systems were identical.

D. Mode Estimation For the RBMS and DBNS

When constructing the RBM(10), RBM(50), DBN(50–50),and DBN(50–50-50) systems, the mode of each RBM orDBN trained for a context-dependent state was estimated for

TABLE IVAVERAGE LOG-PROBABILITIES OF THE SAMPLE MEANS AND THE ESTIMATED

MODES FOR THE FOUR RBM OR DBN BASED SYSTEMS

Fig. 6. The spectral envelopes recovered from the modes of different systemsfor one HMM state.

Gaussian approximation following the methods introduced inSection III-C and Section III-D. For each system, the averageLog-Probabilities of the estimated modes and the samplemeans were calculated. The results are listed in Table IV. Fromthis table, we see that the estimated modes have much higherlog-probabilities than the sample means known to have thehighest probability for a single Gaussian distribution. Thismeans that when the RBMs or the DBNs are adopted to repre-sent the state PDFs, the feature vector with the highest outputprobability is not the sample means anymore. This implies thesuperiority of RBM and DBN over single Gaussian distributionin alleviating the over-smoothing problem during parametergeneration under the maximum output probability criterion.The spectral envelopes recovered from the modes of different

systems for one HMM state4 are illustrated in Fig. 6. Here,only the static components of the spectral envelope featurevectors are drawn. The mode of the GMM(1) system is justthe Gaussian mean vector. The mode of the GMM(8) systemis approximated as the Gaussian mean of the mixture with thehighest mixture weight. Comparing GMM(8) with GMM, wecan see that using more Gaussian mixtures can help alleviatethe over-smoothing effect on the spectral envelope. Besides,the estimated state mode of the RBM and DBN based systemshave sharper formant structures than the GMM-based ones.Comparing RBM(50) with RBM(10), we can see the advantagesof using more hidden units in an RBM. While the differencesamong the estimated modes of the RBM(50), DBN(50–50), andDBN(50–50-50) systems are less significant. We will investi-gate the performance of these systems further by the followingsubjective evaluation.

4This state is not one of the three states used in Section IV-B.

Page 9: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ... · 1Here, the “level” refers to the steps of signal processing procedures in-volved in the spectral feature extraction. The

LING et al.: MODELING SPECTRAL ENVELOPES USING RBMs AND DBNs 2137

TABLE VSUBJECTIVE PREFERENCE SCORES (%) AMONG SPEECH SYNTHESIZED USINGTHE BASELINE, GMM(8), RBM(10), AND RBM(50) SYSTEMS, WHERE N/PDENOTES “NO PREFERENCE” AND MEANS THE -VALUE OF -TEST

BETWEEN THESE TWO SYSTEMS

E. Subjective Evaluation

Because the mel-cepstrum extraction can be considered as akind of linear transform to the logarithmized spectral envelope,the spectral envelope recovered from the mean of the mel-cep-stra in a state is very close to the one recovered from the mean ofthe corresponding logarithmized spectral envelopes. Therefore,theBaseline and theGMM(1) systems had very similar syntheticresults and the Baseline system was adopted as a representativefor these two systems in the subjective evaluation to simplify thetest design. For the GMM(8) system, the EM-based parametergeneration algorithm [4] could be applied to predict the spec-tral envelope trajectories by iteratively updating. In order to geta closed-form solution, we made a single Gaussian approxima-tion to the GMMs at synthesis time by only using the Gaussianmixture with the highest mixture weight at each HMM state.The first subjective evaluation was to compare among the

Baseline, GMM(8), RBM(10), and RBM(50) systems. Fifteensentences out of the training database were selected and syn-thesized using these four systems respectively.5 Five groups ofpreference tests were conducted and each one was to make com-parison between two of the four systems as shown in each row ofTable V. Each of the pairs of synthetic sentences were evaluatedin random order by five Chinese-native listeners. The listenerswere asked to identify which sentence in each pair soundedmore natural. Table V summarizes the preference scores amongthese four systems and the -values given by -test. From thistable, we can see that introducing the density models that aremore complex than single Gaussian, such as GMM and RBM,to model the spectral envelopes at each HMM state can achievesignificantly better naturalness than the single Gaussian dis-tribution based methods. Compared with the GMM(8) system,the RBM(50) system has much better preference in naturalness.This demonstrates the superiority of RBM over GMM in mod-eling the spectral envelope features. A comparison between thespectral envelopes generated by the Baseline system and theRBM(50) system is shown in Fig. 7. From this figure, we can ob-serve the enhanced formant structures after modeling the spec-tral envelopes using RBMs. Besides, we can also find in Table Vthat the performance of the RBM-based systems is influenced bythe number of hidden units used in the model definition whencomparingRBM(10)with RBM(50). These results are consistentwith the formant sharpness of the estimated modes for differentsystems shown in Fig. 6.In order to investigate the effect of extending RBM to DBN

with more hidden layers, another subjective evaluation was con-ducted among the RBM(50), DBN(50-50), and DBN(50-50-50)systems. Another fifteen sentences out of the training database

5Some examples of the synthetic speech using the seven systems listed inTable III can be found at http://staff.ustc.edu.cn/ zhling/DBNSyn/demo.html.

Fig. 7. The spectrograms of a segment of synthetic speech using (a) the Base-line system and (b) the RBM(50) system. These spectrograms are not calculatedby STFT analysis on the synthetic waveform. For the Baseline system, the spec-trogram is drawn based on the spectral envelopes recovered from the generatedmel-cepstra. For the RBM(50) system, the spectrogram is drawn based on thegenerated spectral envelopes directly.

TABLE VISUBJECTIVE PREFERENCE SCORES (%) AMONG THE RBM(50),

DBN(50-50), AND DBN(50-50-50) SYSTEMS

were used and two groups of preference tests were conducted byfive Chinese-native listeners. The results are shown in Table VI.We can see that there is no significant differences among thesethree systems at 0.05 significance level. Although we can im-prove the model accuracy by introducing more hidden layers asshown in Table II, the naturalness of synthetic speech can not beimproved correspondingly. One possible reason is the approxi-mation we make in (27) when estimating the DBN mode.

F. Objective Evaluation

Besides the subjective evaluation, we also calculated thespectral distortions on the test set between the spectral en-velopes generated by the systems listed in Table III and theones extracted from the natural recordings. The syntheticspectral envelopes used the state boundaries of the naturalrecordings to simplify the frame alignment. Then, the naturaland synthetic spectral envelopes at each frame were normalizedto the same power and the calculation of the spectral distortionbetween them followed the method introduced in [38]. For theBaseline system, the generated mel-cepstra were converted tospectral envelopes before the calculation. The average spectraldistortions of all the systems are listed in Table VII. We can seethat the objective evaluation results are inconsistent with thesubjective preference scores shown in Table V. For example,the RBM(50) system has significant better naturalness thanthe Baseline system in the subjective evaluation, while itsaverage spectral distortion is the highest. The reason is thatthe spectral distortion in [38] is a Euclidean distance betweentwo logarithmized spectral envelopes, which treats each di-mension of the spectral envelopes independently and equally.However, the superiority of our proposed method is to provide

Page 10: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ... · 1Here, the “level” refers to the steps of signal processing procedures in-volved in the spectral feature extraction. The

2138 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 10, OCTOBER 2013

TABLE VIIAVERAGE SPECTRAL DISTORTIONS (SD) ON TEST SET BETWEEN THESPECTRAL ENVELOPES GENERATED BY THE SYSTEMS LISTED INAND THE ONES EXTRACTED FROM THE NATURAL RECORDINGS

better representation of the cross-dimension correlations forthe spectral envelope modeling, which can not be reflectedby this spectral distortion measurement. Similar inconsistencybetween subjective evaluation results and objective acousticdistortions for speech synthesis has been observed in [12], [39].

V. CONCLUSION AND FUTURE WORK

We have proposed an RBM and DBN based spectral en-velope modeling method for statistical parametric speechsynthesis in this paper. The spectral envelopes extracted bySTRAIGHT vocoder are modeled by an RBM or a DBN foreach HMM state. At the synthesis time, the mode vectorsof the trained RBMs and DBNs are estimated and used inplace of the Gaussian means for parameter generation. Ourexperimental results show the superiority of RBM and DBNover Gaussian mixture model in describing the distributionof spectral envelopes as density models and in mitigating theover-smoothing effect of the synthetic speech.As we discussed in Section I, there are also some other

approaches that can significantly reduce the over-smoothingand improve the quality of the synthetic speech, such as theGV-based parameter generation [11] and the post-filteringtechniques [6], [13]. In this paper, we focus on the acousticmodeling to tackle the over-smoothing problem. It worth toinvestigate alternative parameter generation and post-filteringalgorithms that are appropriate for our proposed spectral enve-lope modeling method in the future.This paper only makes some preliminary exploration on

applying the ideas of deep learning into statistical parametricspeech synthesis. There are still several issues in the currentimplementation that require further investigation. First, it isworth examining the system performance when the numberof hidden units in the RBMs keeps increasing. As shown inFig. 3, the training samples are distributed among many con-text-dependent HMM states in a highly unbalanced manner.Thus, it may be difficult to optimize the model complexity forall states simultaneously. An alternative solution is to trainthe joint distribution between the observations and the contextlabels using a single network [20] which is estimated using alltraining samples. Similar approach for the statistical parametricspeech synthesis has been studied in [30], where the jointdistribution between the tonal syllable ID and the spectral andexcitation features is modeled using a multi-distribution DBN.Second, increasing the number of hidden layers in the DBNsdidn’t achieve improvement in our subjective evaluation. Abetter algorithm to estimate the mode of a DBN with lessapproximation is necessary. We plan as our future work to

implement the Gaussian approximation according to (25) whenthe number of hidden units is reasonably small and compareits performance with our current implementation. Anotherstrategy is to adopt the sampling outputs rather than the modelmodes during parameter generation. As better density models,the RBM and DBN are more appropriate than the GMM forgenerating acoustic features by sampling, which may helpmake the synthetic speech less monotonic and boring. Third, inthe work presented in this paper, the decision tress for modelclustering are still constructed using mel-cepstra and singleGaussian state PDF. To extend the RBM and DBN modelingfrom PDF estimation for the clustered states to model clusteringfor the fully context-dependent states will also be a task of ourfuture work. Besides the spectral envelopes used in this paper,it is also straightforward to apply our proposed method to themodeling and generation of other forms of speech parameters,such as the articulatory features recorded by electromagneticarticulography (EMA) for articulatory movement prediction[40], the joint distribution between the acoustic features andthe articulatory features for articulatory control of HMM-basedspeech synthesis [41], and the joint spectral distribution be-tween the source and target speakers for voice conversion [42].

REFERENCES[1] Z.-H. Ling, L. Deng, and D. Yu, “Modeling spectral envelopes using

restricted Boltzmann machines for statistical parametric speech syn-thesis,” in Proc. ICASSP, 2013, pp. 7825–7829.

[2] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura,“Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis,” in Proc. Eurospeech, 1999, pp. 2347–2350.

[3] K. Tokuda, H. Zen, and A. W. Black, “HMM-based approach to multi-lingual speech synthesis,” in Text to speech synthesis: New paradigmsand advances, S. Narayanan and A. Alwan, Eds. Upper Saddle River,NJ, USA: Prentice-Hall, 2004.

[4] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura,“Speech parameter generation algorithms for HMM-based speech syn-thesis,” in Proc. ICASSP, 2000, vol. 3, pp. 1315–1318.

[5] H. Zen, T. Toda, M. Nakamura, and K. Tokuda, “Details of nitechHMM-based speech synthesis system for the blizzard challenge 2005,”IEICE Trans. Inf. Syst., vol. E90-D, no. 1, pp. 325–333, 2007.

[6] Z.-H. Ling, Y.-J. Wu, Y.-P. Wang, L. Qin, and R.-H. Wang, “USTCsystem for blizzard challenge 2006: an improved HMM-based speechsynthesis method,” in Proc. Blizzard Challenge Workshop, 2006.

[7] H. Zen, K. Tokuda, and A. W. Black, “Statistical parametric speechsynthesis,” Speech Commun., vol. 51, pp. 1039–1064, 2009.

[8] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigne, “Restructuringspeech representations using pitch-adaptive time-frequency smoothingand an instantaneous-frequency-based F0 extraction: possible role of arepetitive structure in sounds,” Speech Commun., vol. 27, pp. 187–207,1999.

[9] H. Zen, K. Tokuda, and T. Kitamura, “Reformulating the HMM as atrajectory model by imposing explicit relationships between static anddynamic feature vector sequences,”Comput. Speech Lang., vol. 21, no.1, pp. 153–173, 2006.

[10] Y. Wu and R. Wang, “Minimum generation error training for HMM-based speech synthesis,” in Proc. ICASSP, 2006, pp. 89–92.

[11] T. Toda and K. Tokuda, “A speech parameter generation algorithmconsidering global variance for HMM-based speech synthesis,” IEICETrans. Inf. Syst., vol. E90-D, no. 5, pp. 816–824, 2007.

[12] Z.-H. Ling and L.-R. Dai, “Minimum kullback-leibler divergence pa-rameter generation for HMM-based speech synthesis,” IEEE Trans.Audio, Speech, Lang. Process., vol. 20, no. 5, pp. 1492–1502, Jul. 2012.

[13] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura,“Mixed excitation for HMM-based speech synthesis,” in Proc. Eu-rospeech, , 2001, pp. 2263–2266.

[14] J. Yu, M. Zhang, J.-H. Tao, and X. Wang, “A novel HMM-based TTSsystem using both continuous HMMs and discrete HMMs,” in Proc.ICASSP, 2007, pp. 709–712.

[15] Z.-H. Ling and R.-H. Wang, “HMM-based unit selection using framesized speech segments,” in Proc. Interspeech, 2006, pp. 2034–2037.

[16] K. Tokuda, H. Zen, and T. Kitamura, Reformulating the HMM as atrajectory model, Tech. Rep. of IEICE 2004.

Page 11: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ... · 1Here, the “level” refers to the steps of signal processing procedures in-volved in the spectral feature extraction. The

LING et al.: MODELING SPECTRAL ENVELOPES USING RBMs AND DBNs 2139

[17] M. Shannon, H. Zen, and W. Byrnez, “The effect of using normalizedmodels in statistical speech synthesis,” in Proc. Interspeech, 2011, pp.121–124.

[18] O. Abdel-Hamid, S. Abdou, and M. Rashwan, “Improving ArabicHMM based speech synthesis quality,” in Proc. Interspeech, 2006, pp.1332–1335.

[19] P. Smolensky, “Information processing in dynamical systems: Foun-dations of harmony theory,” in Parallel Distributed Processing, D. E.Rumelhart and J. L.McClell, Eds. Cambridge,MA,USA:MIT Press,1986, vol. 1, ch. 6, pp. 194–281.

[20] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithmfor deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554,2006.

[21] L. Deng and D. O’Shaughnessy, Speech processing: A dynamic andoptimization-oriented approach. Boca Raton, FL, USA: CRC, 2003.

[22] L. Deng, M. Seltzer, D. Yu, A. Acero, A. Mohamed, and G. E. Hinton,“Binary coding of speech spectrograms using a deep auto-encoder,” inProc. Interspeech, 2010, pp. 1692–1695.

[23] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,”IEEE Trans. Speech Audio Process., vol. 20, no. 1, pp. 30–42, Jan.2012.

[24] G. E. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly,A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury,“Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups,” IEEE Signal Process. Mag.,vol. 29, no. 6, pp. 82–97, Nov. 2012.

[25] B. Uria, S. Renals, and K. Richmond, “A deep neural network foracoustic-articulatory speech inversion,” in Proc. NIPS 2011 WorkshopDeep Learn. Unsupervised Feature Learn., 2011.

[26] L. Deng and D. X. Sun, “A statistical approach to automatic speechrecognition using the atomic speech units constructed from overlap-ping articulatory features,” J. Acoust. Soc. Amer., vol. 95, no. 5, pp.2702–2722, 1994.

[27] L. Deng and J. Ma, “Spontaneous speech recognition using a statis-tical coarticulatory model for the vocal-tract-resonance dynamics,” J.Acoust. Soc. Amer., vol. 108, no. 6, pp. 3036–3050, 2000.

[28] H. Zen, A. Senior, and M. Schuster, “Statistical parametric speechsynthesis using deep neural networks,” in Proc. ICASSP, 2013, pp.7962–7966.

[29] R. Fernandez, A. Rendel, B. Ramabhadran, and R. Hoory, “F0 contourprediction with a deep belief network-Gaussian process hybrid model,”in Proc. ICASSP, 2013, pp. 6885–6889.

[30] S.-Y. Kang, X.-J. Qian, and H. Meng, “Multi-distribution deep beliefnetwork for speech synthesis,” in Proc. ICASSP, 2013, pp. 8012–8016.

[31] R. Salakhutdinov, “Learning deep generative models,” Ph.D. disserta-tion, Univ. of Toronto, Toronto, ON, Canada, 2009.

[32] G. E. Hinton, “Training products of experts by minimizing contrastivedivergence,” Neural Comput., vol. 14, no. 8, pp. 1711–1800, 2002.

[33] G. E. Hinton and R. Salakhutdinov, “Reducing the dimensionality ofdata with neural networks,” Sci., vol. 313, no. 5786, pp. 504–507, 2006.

[34] R. M. Neal, “Connectionist learning of belief networks,” Artif. Intell.,vol. 56, no. 1, pp. 71–113, 1992.

[35] K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi, “Multi-spaceprobability distribution HMM (invited paper),” IEICE Trans. Inf. Syst.,vol. E85-D, no. 3, pp. 455–464, 2002.

[36] K. Shinoda and T.Watanabe, “MDL-based context-dependent subwordmodeling for speech recognition,” J. Acoust. Soc. Jpn. (E), vol. 21, no.2, pp. 79–86, 2000.

[37] J. E. Besag, “On the statistical analysis of dirty pictures,” J. R. Statist.Soc., Ser. B, vol. 48, no. 3, pp. 259–302, 1986.

[38] K. K. Paliwal and B. S. Atal, “Efficient vector quantization of LPCparameters at 24 bits/frame,” IEEE Trans. Speech Audio Process., vol.1, no. 1, pp. 3–14, Jan. 1993.

[39] T. Toda, “Modeling of speech parameter sequence considering globalvariance for HMM-based speech synthesis,” in Hidden MarkovModels, Theory and Applications, P. Dymarski, Ed. New York, NY,USA: InTech, 2011.

[40] Z.-H. Ling, K. Richmond, and J. Yamagishi, “An analysis of HMM-based prediction of articulatory movements,” Speech Commun., vol.52, no. 10, pp. 834–846, 2010.

[41] Z.-H. Ling, K. Richmond, J. Yamagishi, and R.-H. Wang, “Integratingarticulatory features into HMM-based parametric speech synthesis,”IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 6, pp.1171–1185, Aug. 2009.

[42] L.-H. Chen, Z.-H. Ling, Y. Song, and L.-R. Dai, “Joint spectral distri-bution modeling using restricted Boltzmann machines for voice con-version,” in Proc. Interspeech, 2013.

Zhen-Hua Ling (M’10) received the B.E. degree inelectronic information engineering, M.S. and Ph.D.degree in signal and information processing fromUniversity of Science and Technology of China,Hefei, China, in 2002, 2005, and 2008 respectively.From October 2007 to March 2008, he was a MarieCurie Fellow at the Centre for Speech TechnologyResearch (CSTR), University of Edinburgh, UK.From July 2008 to February 2011, he was a jointpostdoctoral researcher at University of Scienceand Technology of China and iFLYTEK Co., Ltd.,

China. He is currently an associate professor at University of Science andTechnology of China. He also worked at the University of Washington, USAas a visiting scholar from August 2012 to August 2013. His research interestsinclude speech processing, speech synthesis, voice conversion, speech analysis,and speech coding. He was awarded IEEE Signal Processing Society YoungAuthor Best Paper Award in 2010.

Li Deng (SM’92–F’04) received the Ph.D. degreefrom the University of Wisconsin-Madison. Hejoined the Department of Electrical and ComputerEngineering, University of Waterloo, Ontario,Canada in 1989 as an assistant professor, where hebecame a tenured full professor in 1996. In 1999,he joined Microsoft Research, Redmond, WA as aSenior Researcher, where he is currently a PrincipalResearcher. Since 2000, he has also been an AffiliateFull Professor and graduate committee member inthe Department of Electrical Engineering at Uni-

versity of Washington, Seattle. Prior to MSR, he also worked or taught at theMassachusetts Institute of Technology, ATR Interpreting Telecom. ResearchLab. (Kyoto, Japan), and HKUST. In the general areas of speech/languagetechnology, machine learning, and signal processing, he has published over 300refereed papers in leading journals and conferences and 4 books. He is a Fellowof the Acoustical Society of America, a Fellow of the IEEE, and a Fellow ofISCA. He served on the Board of Governors of the IEEE Signal ProcessingSociety (2008–2010). More recently, he served as Editor-in-Chief for the IEEESignal Processing Magazine (2009–2011), which earned the highest impactfactor among all IEEE publications and for which he received the 2011 IEEESPS Meritorious Service Award. He currently serves as Editor-in-Chief forthe IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSINGand General Chair of ICASSP-2013. His recent technical work since 2009 onand leadership in industry-scale deep learning with colleagues and academiccollaborators has created significant impact in speech recognition, signalprocessing, and related applications.

Dong Yu (M’97–SM’06) joined Microsoft Corpo-ration in 1998 and the Microsoft Speech ResearchGroup in 2002, where he currently is a seniorresearcher. He holds a Ph.D. degree in computerscience from the University of Idaho, an MS degreein computer science from Indiana University atBloomington, an MS degree in electrical engineeringfrom the Chinese Academy of Sciences, and a BSdegree (with honor) in electrical engineering fromZhejiang University (China). His current researchinterests include speech processing, robust speech

recognition, discriminative training, and machine learning. He has publishedover 120 papers in these areas and is the inventor/coinventor of more than 50granted/pending patents.His most recent work focuses on deep learning and its application in large

vocabulary speech recognition. The context-dependent deep neural networkhidden Markov model (CD-DNN-HMM) he co-proposed and developed hasbeen seriously challenging the dominant position of the conventional GMMbased system for large vocabulary speech recognition.Dr. Dong Yu is a senior member of IEEE, a member of ACM, and a member

of ISCA. He is currently serving as a member of the IEEE Speech and Lan-guage Processing Technical Committee (2013-) and an associate editor of IEEETRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING (2011-).He has served as an associate editor of IEEE Signal Processing Magazine(2008–2011) and the lead guest editor of IEEE TRANSACTIONS ON AUDIO,SPEECH, AND LANGUAGE PROCESSING–Special Issue on Deep Learning forSpeech and Language Processing (2010–2011).