48
Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013

Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Deep Learning in Speech SynthesisHeiga ZenGoogleAugust 31st, 2013

Page 2: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Outline

Background

Deep Learning

Deep Learning in Speech SynthesisMotivationDeep learning-based approachesDNN-based statistical parametric speech synthesisExperiments

Conclusion

Page 3: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Text-to-speech as sequence-to-sequence mapping

• Automatic speech recognition (ASR)Speech (continuous time series) → Text (discrete symbol sequence)

• Machine translation (MT)Text (discrete symbol sequence) → Text (discrete symbol sequence)

• Text-to-speech synthesis (TTS)Text (discrete symbol sequence) → Speech (continuous time series)

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 1 of 50

Page 4: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Speech production process

text(concept)

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r wav

eby

spe

ech

info

rmat

ion

fund

amen

tal f

req

voic

ed/u

nvoi

ced

freq

trans

fer c

har

air flow

Sound sourcevoiced: pulseunvoiced: noise

speech

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 2 of 50

Page 5: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Typical flow of TTS system

Sentence segmentaitonWord segmentationText normalization

Part-of-speech taggingPronunciation

Prosody prediction

Waveform generation

TEXT

Text analysis

SYNTHESIZEDSPEECH

Speech synthesisdiscrete ⇒ discrete

discrete ⇒ continuous

NLP

Speech

Frontend

Backend

This talk focuses on backend

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 3 of 50

Page 6: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Statistical parametric speech synthesis (SPSS) [2]

Modeltraining

Text Text

Featureextraction

Parametergeneration

WaveformsynthesisSpeech Synthesized

Speech

• Large data + automatic training→ Automatic voice building

• Parametric representation of speech→ Flexible to change its voice characteristics

Hidden Markov model (HMM) as its acoustic model→ HMM-based speech synthesis system (HTS) [1]

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 4 of 50

Page 7: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Characteristics of SPSS

• Advantages

− Flexibility to change voice characteristics− Small footprint− Robustness

• Drawback

− Quality

• Major factors for quality degradation [2]

− Vocoder− Acoustic model → Deep learning− Oversmoothing

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 5 of 50

Page 8: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Deep learning [3]

• Machine learning methodology using multiple-layered models

• Motivated by brains, which organize ideas and concepts hierarchically

• Typically artificial neural network (NN) w/ 3 or more levels ofnon-linear operations

Shallow Neural Network Deep Neural Network (DNN)

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 6 of 50

Page 9: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Basic components in NN

Non-linear unit Network of units

j

i

hi = f(z i)

hj

xi... ...z j =

i

xiwij

Examples of activation functions

Logistic sigmoid: f(zj) =1

1 + e−zj

Hyperbolic tangent: f(zj) = tanh (zj)

Rectified linear: f(zj) = max (zj , 0)

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 7 of 50

Page 10: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Deep architecture

• Logistic regression → depth=1

• Kernel machines, decision trees → depth=2

• Ensemble learning (e.g., Boosting [4], tree intersection [5]) →depth++

• N -layer neural network → depth=N + 1In

put v

ecto

r x

Input units

Hidden units

Output vector y

Output units............

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 8 of 50

Page 11: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Difficulties to train DNN

• NN w/ many layers used to give worse performance than NNw/ few layers

− Slow to train− Vanishing gradients [6]− Local minimum

• Since 2006, training DNN significantly improved

− GPU [7]− More data− Unsupervised pretraining (RBM [8], auto-encoder [9])

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 9 of 50

Page 12: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Restricted Boltzmann Machine (RBM) [11]

h ={0,1}jh

v v ={0,1}i

W

• Undirected graphical model

• No connection between visible & hidden units

p(v,h |W ) =1

Z(W )exp {−E(v,h;W )} wij : weight

E(v,h;W ) = −∑

i

bivi −∑

j

cjhj −∑

i,j

viwijhj bi, cj : bias

• Parameters can be estimated by contrastive divergence learning [10]

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 10 of 50

Page 13: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Deep Belief Network (DBN) [8]

• RBMs are stacked to form a DBN• Layer-wise training of RBM is repeated over multiple layers

(pretraining)• Joint optimization as DBN or supervised learning as DNN with

additional final layer (fine tuning)

⇒copy

⇒stacking

DBN

RBM1

RBM2

Input

(Jointly toptimize as DBN)

Supervisedlearningas DNN⇒

DNN

Input

Output

⇒Input

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 11 of 50

Page 14: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Representation learning

Input

Unsupervisedlayer-wise

pre-training

⇒Input

Output

Addingoutput layer

(e.g., softmax)

Input

Output

Supervisedfine-tuning

(backpropagation)

DBN(feature extractor)

DBN + classification layer(feature → classifier)

DNN(feature + classifier)

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 12 of 50

Page 15: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Success of DNN in various machine learning tasks

Tasks

• Vision [12]

• Language

• Speech [13]

Word error rates (%)

Hours of HMM-GMM HMM-GMMTask data HMM-DNN w/ same data w/ more data

Voice Input 5,870 12.3 N/A 16.0YouTube 1,400 47.6 52.3 N/A

Products

• Personalized photo search [14, 15]

• Voice search [16, 17].

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 13 of 50

Page 16: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Conventional HMM-GMM [1]

• Decision tree-clustered HMM with GMM state-output distributions

Acousticfeatures y

Acousticfeatures y

...

Linguisticfeatures x

yes no

yes no yes no

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 15 of 50

Page 17: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Limitation of HMM-GMM approach (1)Hard to integrate feature extraction & modeling

. . . . . .c1 c2 c3 c4 c5 cT. .

. . . . . .s1 s2 s3 s4 s5 sT. .

Cepstra

Spectra⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ dimensinality

reduction

• Typically use lower dimensional approximation of speech spectrum asacoustic feature (e.g., cepstrum, line spectral pairs)

• Hard to model spectrum directly by HMM-GMM due to highdimensionality & strong correlation

→ Waveform-level model [18], mel-cepstral analysis-integrated model[19], STAVOCO [20], MGE-LSD [21]

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 16 of 50

Page 18: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Limitation of HMM-GMM approach (2)Data fragmentation

yes noyes no

...

yes no

yes no yes no

Acoustic space

• Linguistic-to-acoustic mapping by decision trees

• Decision tree splits input space into sub-clusters

• Inefficient to represent complex dependencies between linguistic &acoustic features

→ Boosting [4], tree intersection [5], product of experts [22]Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 17 of 50

Page 19: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Motivation to use deep learning in speech synthesis

• Integrating feature extraction

− Can model high-dimensional, highly correlated features efficiently− Layered architecture with non-linear operations offers feature

extraction to be integrated with acoustic modeling

• Distributed representation

− Can be exponentially more efficient than fragmentedrepresentation

− Better representation ability with fewer parameters

• Layered hierarchical structure in speech production

− concept → linguistic → articulatory → waveform

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 18 of 50

Page 20: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Deep learning-based approaches

Recent applications of deep learning to speech synthesis

• HMM-DBN (USTC/MSR [23, 24])

• DBN (CUHK [25])

• DNN (Google [26])

• DNN-GP (IBM [27])

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 20 of 50

Page 21: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

HMM-DBN [23, 24]

Acousticfeatures y

Acousticfeatures y

...

Linguisticfeatures x

yes no

yes no yes no

DBN i DBN j

• Decision tree-clustered HMM with DBN state-output distributions

• DBNs replaces GMMs

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 21 of 50

Page 22: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

DBN [25]

Acousticfeatures y

Linguisticfeatures x

h1

h2

h3

v

v

• DBN represents joint distribution of linguistic & acoustic features

• DBN replaces decision trees and GMMs

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 22 of 50

Page 23: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

DNN [26]

Acoustic

features y

Linguistic

features x

h1

h2

h3

• DNN represents conditional distribution of acoustic features givenlinguistic features

• DNN replaces decision trees and GMMs

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 23 of 50

Page 24: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

DNN-GP [27]

Gaussian

Process

Regression

Acoustic

features y

Linguistic

features x

h1

h2

h3

• Uses last hidden layer output as input for Gaussian Process (GP)regression

• Replaces last layer of DNN by GP regression

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 24 of 50

Page 25: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Comparison

cep: mel-cepstrum, ap: band aperiodicitiesx: linguistic features, y: acoustic features, c: cluster indexy | x: conditional distribution of y given x(y, x): joint distribution between x and y

HMM HMM DNN-GMM -DBN DBN DNN -GP

cep, ap, F0 spectra cep, ap, F0 cep, ap, F0 F0

parametric parametric parametric parametric non-parametric

y | c← c | x y | c← c | x (y,x) y | x y | h← h | x

HMM-GMM is more computationally efficients than others

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 25 of 50

Page 26: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

...... ...

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry &

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry &

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

...

Statistics (m

ean & var) of speech param

eter vector sequence

...

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesV/UVfeature

Durationprediction

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 27 of 50

Page 27: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Framework

Is this new? . . . no

• NN [28]

• RNN [29]

What’s the difference?

• More layers, data, computational resources

• Better learning algorithm

• Statistical parametric speech synthesis techniques

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 28 of 50

Page 28: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Experimental setup

Database US English female speaker

Training / test data 33000 & 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width / 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0–39 mel-cepstrumfeatures logF0, 5-band aperiodicity, ∆,∆2

HMM 5-state, left-to-right HSMM [30],topology MSD F0 [31], MDL [32]

DNN 1–5 layers, 256/512/1024/2048 units/layerarchitecture sigmoid, continuous F0 [33]

Postprocessing Postfiltering in cepstrum domain [34]

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 30 of 50

Page 29: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Preliminary experiments

• w/ vs w/o grouping questions (e.g., vowel, fricative)

− Grouping (OR operation) can be represented by NN− w/o grouping questions worked more efficiently

• How to encode numeric features for inputs

− Decision tree clustering uses binary questions− Neural network can have numerical values as inputs− Feeding numerical values directly worked more efficiently

• Removing silences

− Decision tree splits silence & speech at the top of the tree− Single neural network handles both of them− Neural network tries to reduce error for silence− Better to remove silence frames as preprocessing

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 31 of 50

Page 30: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Example of speech parameter trajectories

w/o grouping questions, numeric contexts, silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 32 of 50

Page 31: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Objective evaluations

• Objective measures

− Aperiodicity distortion (dB)− Voiced/Unvoiced error rates (%)− Mel-cepstral distortion (dB)− RMSE in logF0

• Sizes of decision trees in HMM systems were tuned by scaling(α) the penalty term in the MDL criterion

− α < 1: larger trees (more parameters)− α = 1: standard setup− α > 1: smaller trees (fewer parameters)

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 33 of 50

Page 32: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Aperiodicity distortion

DNN (512 units / layer)DNN (256 units / layer)

DNN (1024 units / layer) DNN (2048 units / layer)

HMM

α=1

α=16

α=4

1.20

1.22

1.24

1.26

1.28

1.30

1.32

1e+05 1e+06 1e+07

Ape

riodic

ity d

isto

rtio

n (

dB

)

Total number of parameters

4

1

2

3 5 4

1

2

3

5 4

1

2 3

54

1

2

35

α=0.375

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 34 of 50

Page 33: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

V/UV errors

3.2

3.4

3.6

3.8

4.0

4.2

4.4

4.6

1e+05 1e+06 1e+07

Voic

ed

/Unvoic

ed E

rror

Rate

(%

)

Total number of parameters

DNN (512 units / layer)DNN (256 units / layer)

DNN (1024 units / layer) DNN (2048 units / layer)

HMM

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 35 of 50

Page 34: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Mel-cepstral distortion

4.6

4.8

5.0

5.2

5.4

1e+05 1e+06 1e+07

Mel-cepstr

al dis

tort

ion (

dB

)

Total number of parameters

DNN (512 units / layer)DNN (256 units / layer)

DNN (1024 units / layer) DNN (2048 units / layer)

HMM

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 36 of 50

Page 35: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

RMSE in logF0

0.12

0.13

1e+05 1e+06 1e+07

RM

SE

in log F

0

Total number of parameters

DNN (512 units / layer)DNN (256 units / layer)

DNN (1024 units / layer) DNN (2048 units / layer)

HMM

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 37 of 50

Page 36: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar # of parameters

• Paired comparison test

• 173 test sentences, 5 subjects per pair

• Up to 30 pairs per subject

• Crowd-sourced

HMM DNN(α) (#layers × #units) Neutral p value z value

15.8 (16) 38.5 (4 × 256) 45.7 < 10−6 -9.916.1 (4) 27.2 (4 × 512) 56.8 < 10−6 -5.112.7 (1) 36.6 (4 × 1 024) 50.7 < 10−6 -11.5

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 38 of 50

Page 37: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

Conclusion

Deep learning in speech synthesis

• Aims to replace HMM with acoustic model based on deeparchitectures

• Different groups presented different architectures at ICASSP 2013

− HMM-DBN− DBN− DNN− DNN-GP

• DNN-based approach achieved reasonable performance

• Many possible future research topics

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 39 of 50

Page 38: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

References I

[1] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura.

Simultaneous modeling of spectrum, pitch and duration in HMM-basedspeech synthesis.

In Proc. Eurospeech, pages 2347–2350, 1999.

[2] H. Zen, K. Tokuda, and A. Black.

Statistical parametric speech synthesis.

Speech Commun., 51(11):1039–1064, 2009.

[3] Y. Bengio.

Learning deep architectures for AI.

Foundations and Trends in Machine Learning, 2(1):1–127, 2009.

[4] Y. Qian, H. Liang, and F. Soong.

Generating natural F0 trajectory with additive trees.

In Proc. Interspeech, pages 2126–2129, 2008.

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 40 of 50

Page 39: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

References II

[5] K. Yu, H. Zen, F. Mairesse, and S. Young.

Context adaptive training with factorized decision trees for HMM-basedstatistical parametric speech synthesis.

Speech Commun., 53(6):914–923, 2011.

[6] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber.

Gradient flow in recurrent nets: the difficulty of learning long-termdependencies.

In S. Kremer and J. Kolen, editors, A field guide to dynamical recurrentneural networks. IEEE Press, 2001.

[7] R. Raina, A. Madhavan, and A. Ng.

Large-scale deep unsupervised learning using graphics processors.

In Proc. ICML, volume 9, pages 873–880, 2009.

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 41 of 50

Page 40: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

References III

[8] G. Hinton, S. Osindero, and Y.W. Teh.

A fast learning algorithm for deep belief nets.

Neural Computation, 18(7):1527–1554, 2006.

[9] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol.

Stacked denoising autoencoders: Learning useful representations in adeep network with a local denoising criterion.

Journal of Machine Learning Research, 11:3371–3408, 2010.

[10] G.E. Hinton.

Training products of experts by minimizing contrastive divergence.

Neural Computation, 14(8):1771–1800, 2002.

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 42 of 50

Page 41: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

References IV

[11] P Smolensky.

Information processing in dynamical systems: Foundations of harmonytheory.

In D. Rumelhard and J. McClelland, editors, Parallel DistributedProcessing, volume 1, chapter 6, pages 194–281. MIT Press, 1986.

[12] A. Krizhevsky, I. Sutskever, and G. Hinton.

Imagenet classification with deep convolutional neural networks.

In Proc. NIPS, pages 1106–1114, 2012.

[13] G. Hinton, L. Deng, D. Yu, G. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior,V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury.

Deep neural networks for acoustic modeling in speech recognition: Theshared views of four research groups.

IEEE Signal Processing Magazine, 29(6):82–97, 2012.

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 43 of 50

Page 42: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

References V

[14] C. Rosenberg.

Improving photo search: a step across the semantic gap.

http://googleresearch.blogspot.co.uk/2013/06/

improving-photo-search-step-across.html.

[15] K. Yu.

https://plus.sandbox.google.com/103688557111379853702/

posts/fdw7EQX87Eq.

[16] V. Vanhoucke.

Speech recognition and deep learning.

http://googleresearch.blogspot.co.uk/2012/08/

speech-recognition-and-deep-learning.html.

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 44 of 50

Page 43: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

References VI

[17] Bing makes voice recognition on Windows Phone more accurate andtwice as fast.

http://www.bing.com/blogs/site_blogs/b/search/archive/

2013/06/17/dnn.aspx.

[18] R. Maia, H. Zen, and M. Gales.

Statistical parametric speech synthesis with joint estimation of acousticand excitation model parameters.

In Proc. ISCA SSW7, pages 88–93, 2010.

[19] K. Nakamura, K. Hashimoto, Y. Nankaku, and K. Tokuda.

Integration of acoustic modeling and mel-cepstral analysis forHMM-based speech synthesis.

In Proc. ICASSP, pages 7883–7887, 2013.

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 45 of 50

Page 44: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

References VII

[20] T. Toda and K. Tokuda.

Statistical approach to vocal tract transfer function estimation based onfactor analyzed trajectory hmm.

In Proc. ICASSP, pages 3925–3928, 2008.

[21] Y.-J. Wu and K. Tokuda.

Minimum generation error training with direct log spectral distortion onLSPs for HMM-based speech synthesis.

In Proc. Interspeech, pages 577–580, 2008.

[22] H. Zen, M. Gales, Y. Nankaku, and K. Tokuda.

Product of experts for statistical parametric speech synthesis.

IEEE Trans. Audio Speech Lang. Process., 20(3):794–805, 2012.

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 46 of 50

Page 45: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

References VIII

[23] Z.-H. Ling, L. Deng, and D. Yu.

Modeling spectral envelopes using restricted Boltzmann machines forstatistical parametric speech synthesis.

In Proc. ICASSP, pages 7825–7829, 2013.

[24] Z.-H. Ling, L. Deng, and D. Yu.

Modeling spectral envelopes using restricted Boltzmann machines anddeep belief networks for statistical parametric speech synthesis.

IEEE Trans. Audio Speech Lang. Process., 21(10):2129–2139, 2013.

[25] S. Kang, X. Qian, and H. Meng.

Multi-distribution deep belief network for speech synthesis.

In Proc. ICASSP, pages 8012–8016, 2013.

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 47 of 50

Page 46: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

References IX

[26] H. Zen, A. Senior, and M. Schuster.

Statistical parametric speech synthesis using deep neural networks.

In Proc. ICASSP, pages 7962–7966, 2013.

[27] R. Fernandez, A. Rendel, B. Ramabhadran, and R. Hoory.

F0 contour prediction with a deep belief network-Gaussian process hybridmodel.

In Proc. ICASSP, pages 6885–6889, 2013.

[28] O. Karaali, G. Corrigan, and I. Gerson.

Speech synthesis with neural networks.

In Proc. World Congress on Neural Networks, pages 45–50, 1996.

[29] C. Tuerk and T. Robinson.

Speech synthesis using artificial network trained on cepstral coefficients.

In Proc. Eurospeech, pages 1713–1716, 1993.

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 48 of 50

Page 47: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

References X

[30] H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura.

A hidden semi-Markov model-based speech synthesis system.

IEICE Trans. Inf. Syst., E90-D(5):825–834, 2007.

[31] K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi.

Multi-space probability distribution HMM.

IEICE Trans. Inf. Syst., E85-D(3):455–464, 2002.

[32] K. Shinoda and T. Watanabe.

Acoustic modeling based on the MDL criterion for speech recognition.

In Proc. Eurospeech, pages 99–102, 1997.

[33] K. Yu and S. Young.

Continuous F0 modelling for HMM based statistical parametric speechsynthesis.

IEEE Trans. Audio Speech Lang. Process., 19(5):1071–1079, 2011.

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 49 of 50

Page 48: Deep Learning in Speech Synthesis · Deep Learning in Speech Synthesis Heiga Zen Google August 31st, 2013 ... statistical parametric speech synthesis Experiments Conclusion. Text-to-speech

References XI

[34] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura.

Incorporation of mixed excitation model and postfilter into HMM-basedtext-to-speech synthesis.

IEICE Trans. Inf. Syst., J87-D-II(8):1563–1571, 2004.

Heiga Zen Deep Learning in Speech Synthesis August 31st, 2013 50 of 50