13
Non-Autoregressive Neural Text-to-Speech Kainan Peng * 1 Wei Ping * 1 Zhao Song * 1 Kexin Zhao * 1 Abstract In this work, we propose ParaNet, a non- autoregressive seq2seq model that converts text to spectrogram. It is fully convolutional and brings 46.7 times speed-up over the lightweight Deep Voice 3 at synthesis, while obtaining rea- sonably good speech quality. ParaNet also pro- duces stable alignment between text and speech on the challenging test sentences by iteratively im- proving the attention in a layer-by-layer manner. Furthermore, we build the parallel text-to-speech system and test various parallel neural vocoders, which can synthesize speech from text through a single feed-forward pass. We also explore a novel VAE-based approach to train the inverse autoregressive flow (IAF) based parallel vocoder from scratch, which avoids the need for distilla- tion from a separately trained WaveNet as previ- ous work. 1. Introduction Text-to-speech (TTS), also called speech synthesis, has long been a vital tool in a variety of applications, such as human- computer interactions, virtual assistant, and content cre- ation. Traditional TTS systems are based on multi-stage hand-engineered pipelines (Taylor, 2009). In recent years, deep neural networks based autoregressive models have at- tained state-of-the-art results, including high-fidelity audio synthesis (van den Oord et al., 2016), and much simpler seq2seq pipelines (Sotelo et al., 2017; Wang et al., 2017; Ping et al., 2018b). In particular, one of the most popular neural TTS pipeline (a.k.a. “end-to-end") consists of two components (Ping et al., 2018b; Shen et al., 2018): (i) an au- toregressive seq2seq model that generates mel spectrogram from text, and (ii) an autoregressive neural vocoder (e.g., WaveNet) that synthesizes raw waveform from mel spectro- * Equal contribution . 1 Baidu Research, 1195 Bordeaux Dr, Sunnyvale, CA. Speech samples can be found in: https:// parallel-neural-tts-demo.github.io/. Correspon- dence to: Wei Ping <[email protected]>. Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s). gram. This pipeline requires much less expert knowledge and only needs pairs of audio and transcript as training data. However, the autoregressive nature of these models makes them quite slow at synthesis, because they operate sequen- tially at a high temporal resolution of waveform samples and spectrogram. Most recently, several models are proposed for parallel waveform generation (e.g., van den Oord et al., 2018; Ping et al., 2018a; Prenger et al., 2019; Kumar et al., 2019; Bi´ nkowski et al., 2020; Ping et al., 2020). In the end- to-end pipeline, the models (e.g., ClariNet, WaveFlow) still rely on autoregressive component to predict spectrogram fea- tures (e.g., 100 frames per second). In the linguistic feature- based pipeline, the models (e.g., Parallel WaveNet, GAN- TTS) are conditioned on aligned linguistic features from phoneme duration model and F0 from frequency model, which are recurrent or autoregressive models. Both of these TTS pipelines can be slow at synthesis on modern hardware optimized for parallel execution. In this work, we present a fully parallel neural TTS sys- tem by proposing a non-autoregressive text-to-spectrogram model. Our major contributions are as follows: 1. We propose ParaNet, a non-autoregressive attention- based architecture for text-to-speech, which is fully con- volutional and converts text to mel spectrogram. It runs 254.6 times faster than real-time at synthesis on a 1080 Ti GPU, and brings 46.7 times speed-up over its autore- gressive counterpart (Ping et al., 2018b), while obtaining reasonably good speech quality using neural vocoders. 2. ParaNet distills the attention from the autoregressive text-to-spectrogram model, and iteratively refines the alignment between text and spectrogram in a layer-by- layer manner. It can produce more stable attentions than autoregressive Deep Voice 3 (Ping et al., 2018b) on the challenging test sentences, because it does not have the discrepancy between the teacher-forced training and autoregressive inference. 3. We build the fully parallel neural TTS system by com- bining ParaNet with parallel neural vocoder, thus it can generate speech from text through a single feed-forward pass. We investigate several parallel vocoders, including the distilled IAF vocoder (Ping et al., 2018a) and Wave- Glow (Prenger et al., 2019). To explore the possibility of training IAF vocoder without distillation, we also arXiv:1905.08459v3 [cs.CL] 29 Jun 2020

Non-Autoregressive Neural Text-to-Speech · 2020-07-01 · ClariNet (Ping et al.,2018a), and Transformer TTS (Li et al.,2019). In particular, Deep Voice 3, Tacotron and Char2Wav employ

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Non-Autoregressive Neural Text-to-Speech · 2020-07-01 · ClariNet (Ping et al.,2018a), and Transformer TTS (Li et al.,2019). In particular, Deep Voice 3, Tacotron and Char2Wav employ

Non-Autoregressive Neural Text-to-Speech

Kainan Peng∗ 1 Wei Ping∗ 1 Zhao Song∗ 1 Kexin Zhao∗ 1

AbstractIn this work, we propose ParaNet, a non-autoregressive seq2seq model that converts textto spectrogram. It is fully convolutional andbrings 46.7 times speed-up over the lightweightDeep Voice 3 at synthesis, while obtaining rea-sonably good speech quality. ParaNet also pro-duces stable alignment between text and speechon the challenging test sentences by iteratively im-proving the attention in a layer-by-layer manner.Furthermore, we build the parallel text-to-speechsystem and test various parallel neural vocoders,which can synthesize speech from text througha single feed-forward pass. We also explore anovel VAE-based approach to train the inverseautoregressive flow (IAF) based parallel vocoderfrom scratch, which avoids the need for distilla-tion from a separately trained WaveNet as previ-ous work.

1. IntroductionText-to-speech (TTS), also called speech synthesis, has longbeen a vital tool in a variety of applications, such as human-computer interactions, virtual assistant, and content cre-ation. Traditional TTS systems are based on multi-stagehand-engineered pipelines (Taylor, 2009). In recent years,deep neural networks based autoregressive models have at-tained state-of-the-art results, including high-fidelity audiosynthesis (van den Oord et al., 2016), and much simplerseq2seq pipelines (Sotelo et al., 2017; Wang et al., 2017;Ping et al., 2018b). In particular, one of the most popularneural TTS pipeline (a.k.a. “end-to-end") consists of twocomponents (Ping et al., 2018b; Shen et al., 2018): (i) an au-toregressive seq2seq model that generates mel spectrogramfrom text, and (ii) an autoregressive neural vocoder (e.g.,WaveNet) that synthesizes raw waveform from mel spectro-

*Equal contribution . 1Baidu Research, 1195 Bordeaux Dr,Sunnyvale, CA. Speech samples can be found in: https://parallel-neural-tts-demo.github.io/. Correspon-dence to: Wei Ping <[email protected]>.

Proceedings of the 37 th International Conference on MachineLearning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 bythe author(s).

gram. This pipeline requires much less expert knowledgeand only needs pairs of audio and transcript as training data.

However, the autoregressive nature of these models makesthem quite slow at synthesis, because they operate sequen-tially at a high temporal resolution of waveform samples andspectrogram. Most recently, several models are proposedfor parallel waveform generation (e.g., van den Oord et al.,2018; Ping et al., 2018a; Prenger et al., 2019; Kumar et al.,2019; Binkowski et al., 2020; Ping et al., 2020). In the end-to-end pipeline, the models (e.g., ClariNet, WaveFlow) stillrely on autoregressive component to predict spectrogram fea-tures (e.g., 100 frames per second). In the linguistic feature-based pipeline, the models (e.g., Parallel WaveNet, GAN-TTS) are conditioned on aligned linguistic features fromphoneme duration model and F0 from frequency model,which are recurrent or autoregressive models. Both of theseTTS pipelines can be slow at synthesis on modern hardwareoptimized for parallel execution.

In this work, we present a fully parallel neural TTS sys-tem by proposing a non-autoregressive text-to-spectrogrammodel. Our major contributions are as follows:1. We propose ParaNet, a non-autoregressive attention-

based architecture for text-to-speech, which is fully con-volutional and converts text to mel spectrogram. It runs254.6 times faster than real-time at synthesis on a 1080Ti GPU, and brings 46.7 times speed-up over its autore-gressive counterpart (Ping et al., 2018b), while obtainingreasonably good speech quality using neural vocoders.

2. ParaNet distills the attention from the autoregressivetext-to-spectrogram model, and iteratively refines thealignment between text and spectrogram in a layer-by-layer manner. It can produce more stable attentionsthan autoregressive Deep Voice 3 (Ping et al., 2018b) onthe challenging test sentences, because it does not havethe discrepancy between the teacher-forced training andautoregressive inference.

3. We build the fully parallel neural TTS system by com-bining ParaNet with parallel neural vocoder, thus it cangenerate speech from text through a single feed-forwardpass. We investigate several parallel vocoders, includingthe distilled IAF vocoder (Ping et al., 2018a) and Wave-Glow (Prenger et al., 2019). To explore the possibilityof training IAF vocoder without distillation, we also

arX

iv:1

905.

0845

9v3

[cs

.CL

] 2

9 Ju

n 20

20

Page 2: Non-Autoregressive Neural Text-to-Speech · 2020-07-01 · ClariNet (Ping et al.,2018a), and Transformer TTS (Li et al.,2019). In particular, Deep Voice 3, Tacotron and Char2Wav employ

Non-Autoregressive Neural Text-to-Speech

propose an alternative approach, WaveVAE, which canbe trained from scratch within the variational autoen-coder (VAE) framework (Kingma & Welling, 2014).

We organize the rest of paper as follows. Section 2 dis-cusses related work. We introduce the non-autoregressiveParaNet architecture in Section 3. We discuss parallel neuralvocoders in Section 4, and report experimental settings andresults in Section 5. We conclude the paper in Section 6.

2. Related workNeural speech synthesis has obtained the state-of-the-artresults and gained a lot of attention. Several neuralTTS systems were proposed, including WaveNet (van denOord et al., 2016), Deep Voice (Arık et al., 2017a), DeepVoice 2 (Arık et al., 2017b), Deep Voice 3 (Ping et al.,2018b), Tacotron (Wang et al., 2017), Tacotron 2 (Shen et al.,2018), Char2Wav (Sotelo et al., 2017), VoiceLoop (Taig-man et al., 2018), WaveRNN (Kalchbrenner et al., 2018),ClariNet (Ping et al., 2018a), and Transformer TTS (Liet al., 2019). In particular, Deep Voice 3, Tacotron andChar2Wav employ seq2seq framework with the attentionmechanism (Bahdanau et al., 2015), yielding much simplerpipeline compared to traditional multi-stage pipeline. Theirexcellent extensibility leads to promising results for sev-eral challenging tasks, such as voice cloning (Arik et al.,2018; Nachmani et al., 2018; Jia et al., 2018; Chen et al.,2019). All of these state-of-the-art systems are based onautoregressive models.

RNN-based autoregressive models, such as Tacotron andWaveRNN (Kalchbrenner et al., 2018), lack parallelismat both training and synthesis. CNN-based autoregressivemodels, such as Deep Voice 3 and WaveNet, enable parallelprocessing at training, but they still operate sequentiallyat synthesis since each output element must be generatedbefore it can be passed in as input at the next time-step. Re-cently, there are some non-autoregressive models proposedfor neural machine translation. Gu et al. (2018) trains afeed-forward neural network conditioned on fertility val-ues, which are obtained from an external alignment system.Kaiser et al. (2018) proposes a latent variable model forfast decoding, while it remains autoregressiveness betweenlatent variables. Lee et al. (2018) iteratively refines theoutput sequence through a denoising autoencoder frame-work. Arguably, non-autoregressive model plays a moreimportant role in text-to-speech, where the output speechspectrogram usually consists of hundreds of time-steps fora short text input with a few words. Our work is one ofthe first non-autoregressive seq2seq model for TTS and pro-vides as much as 46.7 times speed-up at synthesis over itsautoregressive counterpart (Ping et al., 2018b). There isa concurrent work (Ren et al., 2019), which is based onthe autoregressive transformer TTS (Li et al., 2019) and

can generate mel spectrogram in parallel. Our ParaNet isfully convolutional and lightweight. In contrast to Fast-Speech, it has half of model parameters, requires smallerbatch size (16 vs. 64) for training and provides faster speedat synthesis (see Table 2 for detailed comparison).

Flow-based generative models (Rezende & Mohamed, 2015;Kingma et al., 2016; Dinh et al., 2017; Kingma & Dhariwal,2018) transform a simple initial distribution into a more com-plex one by applying a series of invertible transformations.In previous work, flow-based models have obtained state-of-the-art results for parallel waveform synthesis (van denOord et al., 2018; Ping et al., 2018a; Prenger et al., 2019;Kim et al., 2019; Yamamoto et al., 2019; Ping et al., 2020).

Variational autoencoder (VAE) (Kingma & Welling, 2014;Rezende et al., 2014) has been applied for representationlearning of natural speech for years. It models either thegenerative process of raw waveform (Chung et al., 2015;van den Oord et al., 2017), or spectrograms (Hsu et al.,2019). In previous work, autoregressive or recurrent neuralnetworks are employed as the decoder of VAE (Chung et al.,2015; van den Oord et al., 2017), but they can be quite slowat synthesis. In this work, we employ a feed-forward IAFas the decoder, which enables parallel waveform synthesis.

3. Text-to-spectrogram modelOur parallel TTS system has two components: 1) a feed-forward text-to-spectrogram model, and 2) a parallel wave-form synthesizer conditioned on mel spectrogram. In thissection, we first present an autoregressive model derivedfrom Deep Voice 3 (DV3) (Ping et al., 2018b). We then in-troduce ParaNet, a non-autoregressive text-to-spectrogrammodel (see Figure 1).

3.1. Autoregressive architecture

Our autoregressive model is based on DV3, a convolutionaltext-to-spectrogram architecture, which consists of threecomponents:

• Encoder: A convolutional encoder, which takes text in-puts and encodes them into internal hidden representation.

• Decoder: A causal convolutional decoder, which de-codes the encoder representation with an attention mech-anism to log-mel spectragrams in an autoregressive man-ner with an `1 loss. It starts with a 1× 1 convolution topreprocess the input log-mel spectrograms.

• Converter: A non-causal convolutional post processingnetwork, which processes the hidden representation fromthe decoder using both past and future context informa-tion and predicts the log-linear spectrograms with an `1loss. It enables bidirectional processing.

All these components use the same 1-D convolution block

Page 3: Non-Autoregressive Neural Text-to-Speech · 2020-07-01 · ClariNet (Ping et al.,2018a), and Transformer TTS (Li et al.,2019). In particular, Deep Voice 3, Tacotron and Char2Wav employ

Non-Autoregressive Neural Text-to-Speech

Text

Encoder

Decoder

Converter Linear output

Attention

Mel output

(a) (b)

Figure 1. (a) Autoregressive seq2seq model. The dashed line depicts the autoregressive decoding of mel spectrogram at inference. (b)Non-autoregressive ParaNet model, which distills the attention from a pretrained autoregressive model.

Convolution Block

Input

Conv.

+

Output

Dropout

GLU

(a) (b)

Figure 2. (a) Architecture of ParaNet. Its encoder provides key and value as the textual representation. The first attention block in decodergets positional encoding as the query and is followed by non-causal convolution blocks and attention blocks. (b) Convolution blockappears in both encoder and decoder. It consists of a 1-D convolution with a gated linear unit (GLU) and a residual connection.

with a gated linear unit as in DV3 (see Figure 2 (b) for moredetails). The major difference between our model and DV3is the decoder architecture. The decoder of DV3 has mul-tiple attention-based layers, where each layer consists of acausal convolution block followed by an attention block. Tosimplify the attention distillation described in Section 3.3.1,our autoregressive decoder has only one attention block atits first layer. We find that reducing the number of attentionblocks does not hurt the generated speech quality in general.

3.2. Non-autoregressive architecture

The proposed ParaNet (see Figure 2) uses the same encoderarchitecture as the autoregressive model. The decoder ofParaNet, conditioned solely on the hidden representationfrom the encoder, predicts the entire sequence of log-melspectrograms in a feed-forward manner. As a result, bothits training and synthesis can be done in parallel. Specially,we make the following major architecture modificationsfrom the autoregressive text-to-spectrogram model to thenon-autoregressive model:

1. Non-autoregressive decoder: Without the autoregres-sive generative constraint, the decoder can use non-causal convolution blocks to take advantage of fu-ture context information and to improve model per-

formance. In addition to log-mel spectrograms, it alsopredicts log-linear spectrograms with an `1 loss forslightly better performance. We also remove the 1× 1convolution at the beginning, because the decoder doesnot take log-mel spectrograms as input.

2. No converter: Non-autoregressive model removes thenon-causal converter since it already employs a non-causal decoder. Note that, the major motivation ofintroducing non-causal converter in DV3 is to refinethe decoder predictions based on bidirectional contextinformation provided by non-causal convolutions.

3.3. Parallel attention mechanism

It is challenging for the feed-forward model to learn the ac-curate alignment between the input text and output spectro-gram. In particular, we need the full parallelism within theattention mechanism. For example, the location-sensitive at-tention (Chorowski et al., 2015; Shen et al., 2018) improvesattention stability, but it performs sequentially at both train-ing and synthesis, because it uses the cumulative attentionweights from previous decoder time steps as an additionalfeature for the next time step. Previous non-autoregressivedecoders rely on an external alignment system (Gu et al.,2018), or an autoregressive latent variable model (Kaiser

Page 4: Non-Autoregressive Neural Text-to-Speech · 2020-07-01 · ClariNet (Ping et al.,2018a), and Transformer TTS (Li et al.,2019). In particular, Deep Voice 3, Tacotron and Char2Wav employ

Non-Autoregressive Neural Text-to-Speech

Figure 3. Our ParaNet iteratively refines the attention alignment in a layer-by-layer way. One can see the 1st layer attention is mostlydominated by the positional encoding prior. It becomes more and more confident about the alignment in the subsequent layers.

et al., 2018).

In this work, we present several simple & effective tech-niques, which could obtain accurate and stable attentionalignment. In particular, our non-autoregressive decodercan iteratively refine the attention alignment between textand mel spectrogram in a layer-by-layer manner as illus-trated in Figure 3. Specially, the decoder adopts a dot-product attention mechanism and consists of K attentionblocks (see Figure 2 (a)), where each attention block usesthe per-time-step query vectors from convolution block andper-time-step key vectors from encoder to compute the at-tention weights (Ping et al., 2018b). The attention blockcomputes context vectors as the weighted average of thevalue vectors from the encoder. The non-autoregressivedecoder starts with an attention block, in which the queryvectors are solely positional encoding (see Section 3.3.2 fordetails). The first attention block then provides the input forthe convolution block at the next attention-based layer.

3.3.1. ATTENTION DISTILLATION

We use the attention alignments from a pretrained autore-gressive model to guide the training of non-autoregressivemodel. Specifically, we minimize the cross entropy betweenthe attention distributions from the non-autoregressiveParaNet and a pretrained autoregressive teacher. We denotethe attention weights from the non-autoregressive ParaNetas W (k)

i,j , where i and j index the time-step of encoderand decoder respectively, and k refers to the k-th attentionblock within the decoder. Note that, the attention weights{W (k)

i,j }Mi=1 form a valid distribution. We compute the atten-tion loss as the average cross entropy between the ParaNetand teacher’s attention distributions:

latten = − 1

KN

K∑k=1

N∑j=1

M∑i=1

W ti,j logW

(k)i,j , (1)

whereW ti,j are the attention weights from the autoregressive

teacher, M and N are the lengths of encoder and decoder,respectively. Our final loss function is a linear combinationof latten and `1 losses from spectrogram predictions. We set

the coefficient of latten as 4, and other coefficients as 1 in allexperiments.

3.3.2. POSITIONAL ENCODING

We use a similar positional encoding as in DV3 at everyattention block (Ping et al., 2018b). The positional encod-ing is added to both key and query vectors in the attentionblock, which forms an inductive bias for monotonic atten-tion. Note that, the non-autoregressive model solely relieson its attention mechanism to decode mel spectrogramsfrom the encoded textual features, without any autoregres-sive input. This makes the positional encoding even morecrucial in guiding the attention to follow a monotonic pro-gression over time at the beginning of training. The posi-tional encodings hp(i, k) = sin (ωsi/10000k/d) (for even i),and cos (ωsi/10000k/d) (for odd i), where i is the time-stepindex, k is the channel index, d is the total number of chan-nels in the positional encoding, and ωs is the position ratewhich indicates the average slope of the line in the attentiondistribution and roughly corresponds to the speed of speech.We set ωs in the following ways:

• For the autoregressive teacher, ωs is set to one for thepositional encoding of query. For the key, it is set tothe averaged ratio of the time-steps of spectrograms tothe time-steps of textual features, which is around 6.3across our training dataset. Taking into account that areduction factor of 4 is used to simplify the learning ofattention mechanism (Wang et al., 2017) , ωs is simplyset as 6.3/4 for the key at both training and synthesis.

• For ParaNet, ωs is also set to one for the query, whileωs for the key is calculated differently. At training, ωsis set to the ratio of the lengths of spectrograms andtext for each individual training instance, which is alsodivided by a reduction factor of 4. At synthesis, weneed to specify the length of output spectrogram and thecorresponding ωs, which actually controls the speechrate of the generated audios (see Section II on demowebsite). In all of our experiments, we simply set ωsto be 6.3/4 as in autoregressive model, and the length ofoutput spectrogram as 6.3/4 times the length of input text.

Page 5: Non-Autoregressive Neural Text-to-Speech · 2020-07-01 · ClariNet (Ping et al.,2018a), and Transformer TTS (Li et al.,2019). In particular, Deep Voice 3, Tacotron and Char2Wav employ

Non-Autoregressive Neural Text-to-Speech

Such a setup yields an initial attention in the form of adiagonal line and guides the non-autoregressive decoderto refine its attention layer by layer (see Figure 3).

3.3.3. ATTENTION MASKING

Inspired by the attention masking in Deep Voice 3, we pro-pose an attention masking scheme for the non-autoregressiveParaNet at synthesis:

• For each query from decoder, instead of computing thesoftmax over the entire set of encoder key vectors, wecompute the softmax only over a fixed window centeredaround the target position and going forward and back-ward several time-steps (e.g., 3). The target position iscalculated as biquery × 4/6.3e, where iquery is the time-step index of the query vector, and be is the roundingoperator.

We observe that this strategy reduces serious attention er-rors such as repeating or skipping words, and also yieldsclearer pronunciations, thanks to its more condensed atten-tion distribution. Note that, this attention masking is sharedacross all attention blocks once it is generated, and doesnot prevent the parallel synthesis of the non-autoregressivemodel.

4. Parallel waveform modelAs an indispensable component in our parallel neural TTSsystem, the parallel waveform model converts the mel spec-trogram predicted from ParaNet into the raw waveform. Inthis section, we discuss several existing parallel waveformmodels, and explore a new alternative in the system.

4.1. Flow-based waveform models

Inverse autoregressive flow (IAF) (Kingma et al., 2016) isa special type of normalizing flow where each invertibletransformation is based on an autoregressive neural net-work. IAF performs synthesis in parallel and can easilyreuse the expressive autoregressive architecture, such asWaveNet (van den Oord et al., 2016), which leads to thestate-of-the-art results for speech synthesis (van den Oordet al., 2018; Ping et al., 2018a). However, the likelihoodevaluation in IAF is autoregressive and slow, thus previoustraining methods rely on probability density distillation froma pretrained autoregressive WaveNet. This two-stage dis-tillation process complicates the training pipeline and mayintroduce pathological optimization (Huang et al., 2019).

RealNVP (Dinh et al., 2017) and Glow (Kingma & Dhari-wal, 2018) are different types of normalizing flows, whereboth synthesis and likelihood evaluation can be performedin parallel by enforcing bipartite architecture constraints.Most recently, both of them were applied as parallel neu-

ral vocoders and can be trained from scratch (Prengeret al., 2019; Kim et al., 2019). However, these modelsare less expressive than their autoregressive and IAF coun-terparts. One can find a detailed analysis in WaveFlowpaper (Ping et al., 2020). In general, these bipartite flowsrequire larger number of layers and hidden units, whichlead to huge number of parameters. For example, a WaveG-low vocoder (Prenger et al., 2019) has 87.88M parameters,whereas IAF vocoder has much smaller footprint with only2.17M parameters (Ping et al., 2018a), making it more pre-ferred in production deployment.

4.2. WaveVAE

Given the advantage of IAF vocoder, it is interesting toinvestigate whether it can be trained without the densitydistillation. One related work trains IAF within an auto-encoder (Huang et al., 2019). Our method uses the VAEframework, thus it is termed as WaveVAE. In contrastto van den Oord et al. (2018) and Ping et al. (2018a), Wave-VAE can be trained from scratch by jointly optimizing theencoder qφ(z|x, c) and decoder pθ(x|z, c), where z is la-tent variables and c is the mel spectrogram conditioner. Weomit c for concise notation hereafter.

4.2.1. ENCODER

The encoder of WaveVAE qφ(z|x) is parameterized by aGaussian autoregressive WaveNet (Ping et al., 2018a) thatmaps the ground truth audio x into the same length la-tent representation z. Specifically, the Gaussian WaveNetmodels xt given the previous samples x<t as xt ∼N(µ(x<t;φ), σ(x<t;φ)

), where the mean µ(x<t;φ) and

scale σ(x<t;φ) are predicted by WaveNet, respectively. Theencoder posterior is constructed as,

qφ(z|x) =∏t

qφ(zt | x≤t),

where qφ(zt | x≤t) = N(xt − µ(x<t;φ)

σ(x<t;φ), ε).

Note that, the mean µ(x<t;φ) and scale σ(x<t) are appliedfor “whitening” the posterior distribution. We introducea trainable scalar ε > 0 to decouple the global variation,which will make optimization process easier. Given theobserved x, the qφ(z|x) admits parallel sampling of latentsz. One can build the connection between the encoder ofWaveVAE and the teacher model of ClariNet, as both ofthem use a Gaussian WaveNet to guide the training of IAFfor parallel wave generation.

4.2.2. DECODER

Our decoder pθ(x|z) is parameterized by the one-step-ahead predictions from an IAF (Ping et al., 2018a). Welet z(0) = z and apply a stack of IAF transformations

Page 6: Non-Autoregressive Neural Text-to-Speech · 2020-07-01 · ClariNet (Ping et al.,2018a), and Transformer TTS (Li et al.,2019). In particular, Deep Voice 3, Tacotron and Char2Wav employ

Non-Autoregressive Neural Text-to-Speech

from z(0) → . . . z(i) → . . . z(n), and each transformationz(i) = f(z(i−1); θ) is defined as,

z(i) = z(i−1) · σ(i) + µ(i), (2)

where µ(i)t = µ(z

(i−1)<t ; θ) and σ(i)

t = σ(z(i−1)<t ; θ) are shift-

ing and scaling variables modeled by a Gaussian WaveNet.One can show that, given z(0) ∼ N (µ(0),σ(0)) from theGaussian prior or encoder, the per-step p(z(n)t | z(0)<t ) alsofollows Gaussian with scale and mean as,

σtot =

n∏i=0

σ(i), µtot =

n∑i=0

µ(i)n∏j>i

σ(j). (3)

Lastly, we set x = ε ·σtot +µtot, where ε ∼ N (0, I). Thus,pθ(x | z) = N (µtot,σtot). For the generative process, weuse the standard Gaussian prior p(z) = N (0, I).

4.2.3. TRAINING OBJECTIVE

We maximize the evidence lower bound (ELBO) for ob-served x in VAE,

maxφ,θ

Eqφ(z|x)[log pθ(x|z)

]−KL

(qφ(z|x) || p(z)

), (4)

where the KL divergence can be calculated in closed-formas both qφ(z|x) and p(z) are Gaussians,

KL(qφ(z|x) || p(z)

)=∑t

log1

ε+

1

2

(ε2 − 1 +

(xt − µ(x<t)σ(x<t)

)2).

The reconstruction term in Eq. (4) is intractable to computeexactly. We do stochastic optimization by drawing a samplez from the encoder qφ(z|x) through the reparameterizationtrick, and evaluating the likelihood log pθ(x|z). To avoidthe “posterior collapse”, in which the posterior distributionqφ(z|x) quickly collapses to the white noise prior p(z) atthe early stage of training, we apply the annealing strategyfor KL divergence, where its weight is gradually increasedfrom 0 to 1, via a sigmoid function (Bowman et al., 2016).Through it, the encoder can encode sufficient informationinto the latent representations at the early training, and thengradually regularize the latent representation by increasingthe weight of the KL divergence.

STFT loss: Similar to Ping et al. (2018a), we also add ashort-term Fourier transform (STFT) loss to improve thequality of synthesized speech. We define the STFT lossas the summation of `2 loss on the magnitudes of STFTand `1 loss on the log-magnitudes of STFT between theoutput audio and ground truth audio (Ping et al., 2018a;Arık et al., 2019; Wang et al., 2019). For STFT, we usea 12.5ms frame-shift, 50ms Hanning window length, andwe set the FFT size to 2048. We consider two STFT losses

in our objective: (i) the STFT loss between ground truthaudio and reconstructed audio using encoder qφ(z|x); (ii)the STFT loss between ground truth audio and synthesizedaudio using the prior p(z), with the purpose of reducing thegap between reconstruction and synthesis. Our final loss isa linear combination of VAE objective in Eq. (4) and theSTFT losses. The corresponding coefficients are simply setto be one in all of our experiments.

5. ExperimentIn this section, we present several experiments to evaluatethe proposed ParaNet and WaveVAE.

5.1. Settings

Data: In our experiment, we use an internal English speechdataset containing about 20 hours of speech data from afemale speaker with a sampling rate of 48 kHz. We down-sample the audios to 24 kHz.

Text-to-spectrogram models: For both ParaNet and DeepVoice 3 (DV3), we use the mixed representation of char-acters and phonemes (Ping et al., 2018b). The default hy-perparameters of ParaNet and DV3 are provided in Table 1.Both ParaNet and DV3 are trained for 500K steps usingAdam optimizer (Kingma & Ba, 2015). We find that largerkernel width and deeper layers generally help improve theperformance of ParaNet. In terms of the number of parame-ters, our ParaNet (17.61 M params) is 2.57× larger than theDeep Voice 3 (6.85M params) and 1.71× smaller than theFastSpeech (30.1M params) (Ren et al., 2019). We use anopen source reimplementation of FastSpeech 1 by adaptingthe hyperparameters for handling the 24kHz dataset.

Neural vocoders: In this work, we compare various neuralvocoders paired with text-to-spectrogram models, includingWaveNet (van den Oord et al., 2016), ClariNet (Ping et al.,2018a), WaveVAE, and WaveGlow (Prenger et al., 2019).We train all neural vocoders on 8 Nvidia 1080Ti GPUs usingrandomly chosen 0.5s audio clips.

We train two 20-layer WaveNets with residual channel 256conditioned on the predicted mel spectrogram from ParaNetand DV3, respectively. We apply two layers of convolutionblock to process the predicted mel spectrogram, and use twolayers of transposed 2-D convolution (in time and frequency)interleaved with leaky ReLU (α = 0.4) to upsample theoutputs from frame-level to sample-level. We use the Adamoptimizer (Kingma & Ba, 2015) with a batch size of 8 and alearning rate of 0.001 at the beginning, which is annealedby half every 200K steps. We train the models for 1M steps.

We use the same IAF architecture as ClariNet (Ping et al.,2018a). It consists of four stacked Gaussian IAF blocks,

1https://github.com/xcmyz/FastSpeech

Page 7: Non-Autoregressive Neural Text-to-Speech · 2020-07-01 · ClariNet (Ping et al.,2018a), and Transformer TTS (Li et al.,2019). In particular, Deep Voice 3, Tacotron and Char2Wav employ

Non-Autoregressive Neural Text-to-Speech

Table 1. Hyperparameters of autoregressive text-to-spectrogram model and non-autoregressive ParaNet in the experiment.

Hyperparameter Autoregressive Model Non-autoregressive ModelFFT Size 2048 2048

FFT Window Size / Shift 1200 / 300 1200 / 300Audio Sample Rate 24000 24000Reduction Factor r 4 4

Mel Bands 80 80Character Embedding Dim. 256 256

Encoder Layers / Conv. Width / Channels 7 / 5 / 64 7 / 9 / 64Decoder PreNet Affine Size 128, 256 N/A

Decoder Layers / Conv. Width 4 / 5 17 / 7Attention Hidden Size 128 128

Position Weight / Initial Rate 1.0 / 6.3 1.0 / 6.3PostNet Layers / Conv. Width / Channels 5 / 5 / 256 N/A

Dropout Keep Probability 0.95 1.0ADAM Learning Rate 0.001 0.001

Batch Size 16 16Max Gradient Norm 100 100

Gradient Clipping Max. Value 5.0 5.0Total Number of Parameters 6.85M 17.61M

Table 2. The model footprint, synthesis time for 1 second speech (on 1080Ti with FP32), and the 5-scale Mean Opinion Score (MOS)ratings with 95% confidence intervals for comparison.

Neural TTS system # parameters synthesis time (ms) MOS scoreDV3 + WaveNet 6.85 + 9.08 = 15.93 M 181.8 + 5×105 4.09± 0.26ParaNet + WaveNet 17.61 + 9.08 = 26.69 M 3.9 + 5×105 4.01± 0.24DV3 + ClariNet 6.85 + 2.17 = 9.02 M 181.8 + 64.9 3.88± 0.25ParaNet + ClariNet 17.61 + 2.17 = 19.78 M 3.9 + 64.9 3.62± 0.23DV3 + WaveVAE 6.85 + 2.17 = 9.02 M 181.8 + 64.9 3.70± 0.29ParaNet + WaveVAE 17.61 + 2.17 = 19.78 M 3.9 + 64.9 3.31± 0.32DV3 + WaveGlow 6.85 + 87.88 = 94.73 M 181.8 + 117.6 3.92± 0.24ParaNet + WaveGlow 17.61 + 87.88 = 105.49 M 3.9 + 117.6 3.27± 0.28FastSpeech (re-impl) + WaveGlow 31.77 + 87.88 = 119.65 M 6.2 + 117.6 3.56 ± 0.26

which are parameterized by [10, 10, 10, 30]-layer WaveNetsrespectively, with the 64 residual & skip channels and filtersize 3 in dilated convolutions. The IAF is conditioned onlog-mel spectrograms with two layers of transposed 2-Dconvolution as in ClariNet. We use the same teacher-studentsetup for ClariNet as in Ping et al. (2018a) and we train a 20-layer Gaussian autoregressive WaveNet as the teacher model.For the encoder in WaveVAE, we also use a 20-layers Gaus-sian WaveNet conditioned on log-mel spectrograms. Forthe decoder, we use the same architecture as the distilledIAF. Both the encoder and decoder of WaveVAE share thesame conditioner network. Both of the distilled IAF andWaveVAE are trained on ground-truth mel spectrogram. Weuse Adam optimizer with 1000K steps for distilled IAF. ForWaveVAE, we train it for 400K because it converges muchfaster. The learning rate is set to 0.001 at the beginning andannealed by half every 200K steps for both models.

We use the open source implementation of WaveGlow withdefault hyperparameters (residual channel 256) 2, exceptchange the sampling rate from 22.05kHz to 24kHz, FFTwindow length from 1024 to 1200, and FFT window shiftfrom 256 to 300 for handling the 24kHz dataset. The modelis trained for 2M steps.

5.2. Results

Speech quality: We use the crowdMOS toolkit (Ribeiroet al., 2011) for subjective Mean Opinion Score (MOS)evaluation. We report the MOS results in Table 2. TheParaNet can provide comparable quality of speech as theautoregressive DV3 using WaveNet vocoder (MOS: 4.09vs. 4.01). When we use the ClariNet vocoder, ParaNetcan still provide reasonably good speech quality (MOS:

2https://github.com/NVIDIA/waveglow

Page 8: Non-Autoregressive Neural Text-to-Speech · 2020-07-01 · ClariNet (Ping et al.,2018a), and Transformer TTS (Li et al.,2019). In particular, Deep Voice 3, Tacotron and Char2Wav employ

Non-Autoregressive Neural Text-to-Speech

Table 3. Attention error counts for text-to-spectrogram models on the 100-sentence test set. One or more mispronunciations, skips, andrepeats count as a single mistake per utterance. The non-autoregressive ParaNet (17-layer decoder) with attention mask obtains the fewestattention errors in total. For ablation study, we include the results for two additional ParaNet models. They have 6 and 12 decoder layersand are denoted as ParaNet-6 and ParaNet-12, respectively.

Model Attention mask Repeat Mispronounce Skip TotalDeep Voice 3

No

12 10 15 37ParaNet 1 4 7 12ParaNet-12 5 7 5 17ParaNet-6 4 11 11 26Deep Voice 3

Yes

1 4 3 8ParaNet 2 4 0 6ParaNet-12 4 6 2 12ParaNet-6 3 10 3 16

3.62) as a fully feed-forward TTS system. WaveVAE ob-tains worse results than distilled IAF vocoder, but it canbe trained from scratch and simplifies the training pipeline.When conditioned on predicted mel spectrogram, WaveG-low tends to produce constant frequency artifacts. To rem-edy this, we applied the denoising function with strength0.1, as recommended in the repository of WaveGlow. Itis effective when the predicted mel spectrograms are fromDV3, but not effective when the predicted mel spectrogramsare from ParaNet. As a result, the MOS score degradesseriously. We add the comparison with FastSpeech afterthe paper submission. Because it is costly to relaunchthe MOS evaluations of all the models, we perform aseparate MOS evaluation for FastSpeech. Note that, thegroup of human raters can be different on Mechanical Turk,and the subjective scores may not be directly comparable.One can find the synthesized speech samples in: https://parallel-neural-tts-demo.github.io/ .

Synthesis speed: We test synthesis speed of all models onNVIDIA GeForce GTX 1080 Ti with 32-bit floating point(FP32) arithmetic. We compare the ParaNet with the autore-gressive DV3 in terms of inference latency. We constructa custom 15-sentence test set (see Appendix A) and runinference for 50 runs on each of the 15 sentences (batch sizeis set to 1). The average audio duration of the utterances is6.11 seconds. The average inference latencies over 50 runsand 15 sentences are 0.024 and 1.12 seconds for ParaNetand DV3, respectively. Hence, our ParaNet runs 254.6 timesfaster than real-time and brings about 46.7 times speed-upover its small-footprint autoregressive counterpart at syn-thesis. It also runs 1.58 times faster than FastSpeech. Wesummarize synthesis speed of TTS systems in Table 2. Onecan observe that the latency bottleneck is the autoregressivetext-to-spectrogram model, when the system uses paral-lel neural vocoder. The ClariNet and WaveVAE vocodershave much smaller footprint and faster synthesis speed thanWaveGlow.

Attention error analysis: In autoregressive models, there

is a noticeable discrepancy between the teacher-forced train-ing and autoregressive inference, which can yield accumu-lated errors along the generated sequence at synthesis (Ben-gio et al., 2015). In neural TTS, this discrepancy leads tomiserable attention errors at autoregressive inference, in-cluding (i) repeated words, (ii) mispronunciations, and (iii)skipped words (see Ping et al. (2018b) for detailed exam-ples), which is a critical problem for online deploymentof attention-based neural TTS systems. We perform an at-tention error analysis for our non-autoregressive ParaNeton a 100-sentence test set (see Appendix B), which in-cludes particularly-challenging cases from deployed TTSsystems (e.g. dates, acronyms, URLs, repeated words,proper nouns, and foreign words). In Table 3, we find thatthe non-autoregressive ParaNet has much fewer attentionerrors than its autoregressive counterpart at synthesis (12vs. 37) without attention mask. Although our ParaNetdistills the (teacher-forced) attentions from an autoregres-sive model, it only takes textual inputs at both training andsynthesis and does not have the similar discrepancy as inautoregressive model. In previous work, attention maskingwas applied to enforce the monotonic attentions and reduceattention errors, and was demonstrated to be effective inDeep Voice 3 (Ping et al., 2018b). We find that our non-autoregressive ParaNet still can have fewer attention errorsthan autoregressive DV3 (6 vs. 8), when both of them usethe attention masking.

5.3. Ablation study

We perform ablation studies to verify the effectiveness ofseveral techniques used in ParaNet, including attention dis-tillation, positional encoding, and stacking decoder layersto refine the attention alignment in a layer-by-layer man-ner. We evaluate the performance of a non-autoregressiveParaNet model trained without attention distillation and findthat it fails to learn meaningful attention alignment. Thesynthesized audios are unintelligible and mostly pure noise.Similarly, we train another non-autoregressive ParaNet

Page 9: Non-Autoregressive Neural Text-to-Speech · 2020-07-01 · ClariNet (Ping et al.,2018a), and Transformer TTS (Li et al.,2019). In particular, Deep Voice 3, Tacotron and Char2Wav employ

Non-Autoregressive Neural Text-to-Speech

model without adding positional encoding in the attentionblock. The resulting model only learns very blurry attentionalignment and cannot synthesize intelligible speech. Finally,we train two non-autoregressive ParaNet models with 6 and12 decoder layers, respectively, and compare them with thedefault non-autoregressive ParaNet model which has 17 de-coder layers. We conduct the same attention error analysison the 100-sentence test set and the results are shown inTable 3. We find that increasing the number of decoderlayers for non-autoregressive ParaNet can reduce the totalnumber of attention errors, in both cases with and withoutapplying attention mask at synthesis.

6. ConclusionIn this work, we build a feed-forward neural TTS sys-tem by proposing a non-autoregressive text-to-spectrogrammodel. The proposed ParaNet obtains reasonably goodspeech quality and brings 46.7 times speed-up over its au-toregressive counterpart at synthesis. We also comparevarious neural vocoders within the TTS system. Our resultssuggest that the parallel vocoder is generally less robustthan WaveNet vocoder, when the front-end acoustic modelis non-autoregressive. As a result, it is interesting to investi-gate small-footprint and robust parallel neural vocoder (e.g.,WaveFlow) in future study.

ReferencesArık, S. O., Chrzanowski, M., Coates, A., Diamos, G.,

Gibiansky, A., Kang, Y., Li, X., Miller, J., Raiman, J.,Sengupta, S., and Shoeybi, M. Deep Voice: Real-timeneural text-to-speech. In ICML, 2017a.

Arık, S. O., Diamos, G., Gibiansky, A., Miller, J., Peng,K., Ping, W., Raiman, J., and Zhou, Y. Deep Voice 2:Multi-speaker neural text-to-speech. In NIPS, 2017b.

Arik, S. O., Chen, J., Peng, K., Ping, W., and Zhou, Y.Neural voice cloning with a few samples. In NIPS, 2018.

Arık, S. Ö., Jun, H., and Diamos, G. Fast spectrograminversion using multi-head convolutional neural networks.IEEE Signal Processing Letters, 2019.

Bahdanau, D., Cho, K., and Bengio, Y. Neural machinetranslation by jointly learning to align and translate. InICLR, 2015.

Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. Sched-uled sampling for sequence prediction with recurrent neu-ral networks. In NIPS, 2015.

Binkowski, M., Donahue, J., Dieleman, S., Clark, A., Elsen,E., Casagrande, N., Cobo, L. C., and Simonyan, K. Highfidelity speech synthesis with adversarial networks. InICLR, 2020.

Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A., Jozefow-icz, R., and Bengio, S. Generating sentences from acontinuous space. In Proceedings of The 20th SIGNLLConference on Computational Natural Language Learn-ing, 2016.

Chen, Y., Assael, Y., Shillingford, B., Budden, D., Reed, S.,Zen, H., Wang, Q., Cobo, L. C., Trask, A., Laurie, B.,et al. Sample efficient adaptive text-to-speech. In ICLR,2019.

Chorowski, J. K., Bahdanau, D., Serdyuk, D., Cho, K., andBengio, Y. Attention-based models for speech recog-nition. In Advances in neural information processingsystems, pp. 577–585, 2015.

Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. C.,and Bengio, Y. A recurrent latent variable model forsequential data. In NIPS, 2015.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estima-tion using real NVP. In ICLR, 2017.

Gu, J., Bradbury, J., Xiong, C., Li, V. O., and Socher, R.Non-autoregressive neural machine translation. In ICLR,2018.

Hsu, W.-N., Zhang, Y., Weiss, R. J., Zen, H., Wu, Y., Wang,Y., Cao, Y., Jia, Y., Chen, Z., Shen, J., et al. Hierarchicalgenerative modeling for controllable speech synthesis. InICLR, 2019.

Huang, C.-W., Ahmed, F., Kumar, K., Lacoste, A., andCourville, A. Probability distillation: A caveat and alter-natives. In UAI, 2019.

Jia, Y., Zhang, Y., Weiss, R. J., Wang, Q., Shen, J., Ren, F.,Chen, Z., Nguyen, P., Pang, R., and Moreno, I. L. Transferlearning from speaker verification to multispeaker text-to-speech synthesis. In NIPS, 2018.

Kaiser, Ł., Roy, A., Vaswani, A., Pamar, N., Bengio, S.,Uszkoreit, J., and Shazeer, N. Fast decoding in sequencemodels using discrete latent variables. In ICML, 2018.

Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S.,Casagrande, N., Lockhart, E., Stimberg, F., Oord, A.v. d., Dieleman, S., and Kavukcuoglu, K. Efficient neuralaudio synthesis. In ICML, 2018.

Kim, S., Lee, S.-g., Song, J., and Yoon, S. FloWaveNet: Agenerative flow for raw audio. In ICML, 2019.

Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. In ICLR, 2015.

Kingma, D. P. and Dhariwal, P. Glow: Generative flow withinvertible 1x1 convolutions. In NIPS, 2018.

Page 10: Non-Autoregressive Neural Text-to-Speech · 2020-07-01 · ClariNet (Ping et al.,2018a), and Transformer TTS (Li et al.,2019). In particular, Deep Voice 3, Tacotron and Char2Wav employ

Non-Autoregressive Neural Text-to-Speech

Kingma, D. P. and Welling, M. Auto-encoding variationalBayes. In ICLR, 2014.

Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X.,Sutskever, I., and Welling, M. Improving variationalinference with inverse autoregressive flow. In NIPS, 2016.

Kumar, K., Kumar, R., de Boissiere, T., Gestin, L., Teoh,W. Z., Sotelo, J., de Brébisson, A., Bengio, Y., andCourville, A. C. Melgan: Generative adversarial net-works for conditional waveform synthesis. In Advancesin Neural Information Processing Systems, pp. 14910–14921, 2019.

Lee, J., Mansimov, E., and Cho, K. Deterministic non-autoregressive neural sequence modeling by iterative re-finement. In EMNLP, 2018.

Li, N., Liu, S., Liu, Y., Zhao, S., Liu, M., and Zhou, M.Neural speech synthesis with transformer network. AAAI,2019.

Nachmani, E., Polyak, A., Taigman, Y., and Wolf, L. Fittingnew speakers based on a short untranscribed sample. InICML, 2018.

Ping, W., Peng, K., and Chen, J. ClariNet: Parallel wavegeneration in end-to-end text-to-speech. arXiv preprintarXiv:1807.07281, 2018a.

Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A.,Narang, S., Raiman, J., and Miller, J. Deep Voice 3: Scal-ing text-to-speech with convolutional sequence learning.In ICLR, 2018b.

Ping, W., Peng, K., Zhao, K., and Song, Z. WaveFlow: Acompact flow-based model for raw audio. In ICML, 2020.

Prenger, R., Valle, R., and Catanzaro, B. WaveGlow: Aflow-based generative network for speech synthesis. InICASSP, 2019.

Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., andLiu, T.-Y. Fastspeech: Fast, robust and controllable textto speech. In Advances in Neural Information ProcessingSystems, pp. 3165–3174, 2019.

Rezende, D. J. and Mohamed, S. Variational inference withnormalizing flows. In ICML, 2015.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochas-tic backpropagation and approximate inference in deepgenerative models. In ICML, 2014.

Ribeiro, F., Florêncio, D., Zhang, C., and Seltzer, M. Crowd-MOS: An approach for crowdsourcing mean opinionscore studies. In ICASSP, 2011.

Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N.,Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerry-Ryan, R.,et al. Natural TTS synthesis by conditioning WaveNeton mel spectrogram predictions. In ICASSP, 2018.

Sotelo, J., Mehri, S., Kumar, K., Santos, J. F., Kastner, K.,Courville, A., and Bengio, Y. Char2wav: End-to-endspeech synthesis. ICLR workshop, 2017.

Taigman, Y., Wolf, L., Polyak, A., and Nachmani, E.VoiceLoop: Voice fitting and synthesis via a phonologicalloop. In ICLR, 2018.

Taylor, P. Text-to-Speech Synthesis. Cambridge UniversityPress, 2009.

van den Oord, A., Dieleman, S., Zen, H., Simonyan, K.,Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., andKavukcuoglu, K. WaveNet: A generative model for rawaudio. arXiv preprint arXiv:1609.03499, 2016.

van den Oord, A., Vinyals, O., and Kavukcuoglu, K. Neuraldiscrete representation learning. In NIPS, 2017.

van den Oord, A., Li, Y., Babuschkin, I., Simonyan, K.,Vinyals, O., Kavukcuoglu, K., Driessche, G. v. d., Lock-hart, E., Cobo, L. C., Stimberg, F., et al. Parallel WaveNet:Fast high-fidelity speech synthesis. In ICML, 2018.

Wang, X., Takaki, S., and Yamagishi, J. Neural source-filter-based waveform model for statistical parametric speechsynthesis. In ICASSP, 2019.

Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss,R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S.,Le, Q., Agiomyrgiannakis, Y., Clark, R., and Saurous,R. A. Tacotron: Towards end-to-end speech synthesis. InInterspeech, 2017.

Yamamoto, R., Song, E., and Kim, J.-M. Probability densitydistillation with generative adversarial networks for high-quality parallel waveform generation. arXiv preprintarXiv:1904.04472, 2019.

Page 11: Non-Autoregressive Neural Text-to-Speech · 2020-07-01 · ClariNet (Ping et al.,2018a), and Transformer TTS (Li et al.,2019). In particular, Deep Voice 3, Tacotron and Char2Wav employ

Non-Autoregressive Neural Text-to-Speech

AppendixA. 15-Sentence Test SetThe 15 sentences used to quantify the inference speed up in Table 2 are listed below (note that % corresponds to pause):

1. WHEN THE SUNLIGHT STRIKES RAINDROPS IN THE AIR%THEY ACT AS A PRISM AND FORM ARAINBOW%.2. THESE TAKE THE SHAPE OF A LONG ROUND ARCH%WITH ITS PATH HIGH ABOVE%AND ITS TWO ENDSAPPARENTLY BEYOND THE HORIZON%.3. WHEN A MAN LOOKS FOR SOMETHING BEYOND HIS REACH%HIS FRIENDS SAY HE IS LOOKING FORTHE POT OF GOLD AT THE END OF THE RAINBOW%.4. IF THE RED OF THE SECOND BOW FALLS UPON THE GREEN OF THE FIRST%THE RESULT IS TO GIVE ABOW WITH AN ABNORMALLY WIDE YELLOW BAND%.5. THE ACTUAL PRIMARY RAINBOW OBSERVED IS SAID TO BE THE EFFECT OF SUPER IMPOSITION OF ANUMBER OF BOWS%.6. THE DIFFERENCE IN THE RAINBOW DEPENDS CONSIDERABLY UPON THE SIZE OF THE DROPS%.7. IN THIS PERSPECTIVE%WE HAVE REVIEWED SOME OF THE MANY WAYS IN WHICH NEUROSCIENCEHAS MADE FUNDAMENTAL CONTRIBUTIONS%.8. IN ENHANCING AGENT CAPABILITIES%IT WILL BE IMPORTANT TO CONSIDER OTHER SALIENTPROPERTIES OF THIS PROCESS IN HUMANS%.9. IN A WAY THAT COULD SUPPORT DISCOVERY OF SUBGOALS AND HIERARCHICAL PLANNING%.10. DISTILLING INTELLIGENCE INTO AN ALGORITHMIC CONSTRUCT AND COMPARING IT TO THE HUMANBRAIN MIGHT YIELD INSIGHTS%.11. THE VAULT THAT WAS SEARCHED HAD IN FACT BEEN EMPTIED EARLIER THAT SAME DAY%.12. ANT LIVES NEXT TO GRASSHOPPER%ANT SAYS%I LIKE TO WORK EVERY DAY%.13. YOUR MEANS OF TRANSPORT FULFIL ECONOMIC REQUIREMENTS IN YOUR CHOSEN COUNTRY%.14. SLEEP STILL FOGGED MY MIND AND ATTEMPTED TO FIGHT BACK THE PANIC%.15. SUDDENLY%I SAW TWO FAST AND FURIOUS FEET DRIBBLING THE BALL TOWARDS MY GOAL%.

B. 100-Sentence Test SetThe 100 sentences used to quantify the results in Table 3 are listed below (note that % corresponds to pause):

1. A B C%.2. X Y Z%.3. HURRY%.4. WAREHOUSE%.5. REFERENDUM%.6. IS IT FREE%?7. JUSTIFIABLE%.8. ENVIRONMENT%.9. A DEBT RUNS%.10. GRAVITATIONAL%.11. CARDBOARD FILM%.12. PERSON THINKING%.13. PREPARED KILLER%.14. AIRCRAFT TORTURE%.15. ALLERGIC TROUSER%.16. STRATEGIC CONDUCT%.17. WORRYING LITERATURE%.18. CHRISTMAS IS COMING%.19. A PET DILEMMA THINKS%.20. HOW WAS THE MATH TEST%?

Page 12: Non-Autoregressive Neural Text-to-Speech · 2020-07-01 · ClariNet (Ping et al.,2018a), and Transformer TTS (Li et al.,2019). In particular, Deep Voice 3, Tacotron and Char2Wav employ

Non-Autoregressive Neural Text-to-Speech

21. GOOD TO THE LAST DROP%.22. AN M B A AGENT LISTENS%.23. A COMPROMISE DISAPPEARS%.24. AN AXIS OF X Y OR Z FREEZES%.25. SHE DID HER BEST TO HELP HIM%.26. A BACKBONE CONTESTS THE CHAOS%.27. TWO A GREATER THAN TWO N NINE%.28. DON’T STEP ON THE BROKEN GLASS%.29. A DAMNED FLIPS INTO THE PATIENT%.30. A TRADE PURGES WITHIN THE B B C%.31. I’D RATHER BE A BIRD THAN A FISH%.32. I HEAR THAT NANCY IS VERY PRETTY%.33. I WANT MORE DETAILED INFORMATION%.34. PLEASE WAIT OUTSIDE OF THE HOUSE%.35. N A S A EXPOSURE TUNES THE WAFFLE%.36. A MIST DICTATES WITHIN THE MONSTER%.37. A SKETCH ROPES THE MIDDLE CEREMONY%.38. EVERY FAREWELL EXPLODES THE CAREER%.39. SHE FOLDED HER HANDKERCHIEF NEATLY%.40. AGAINST THE STEAM CHOOSES THE STUDIO%.41. ROCK MUSIC APPROACHES AT HIGH VELOCITY%.42. NINE ADAM BAYE STUDY ON THE TWO PIECES%.43. AN UNFRIENDLY DECAY CONVEYS THE OUTCOME%.44. ABSTRACTION IS OFTEN ONE FLOOR ABOVE YOU%.45. A PLAYED LADY RANKS ANY PUBLICIZED PREVIEW%.46. HE TOLD US A VERY EXCITING ADVENTURE STORY%.47. ON AUGUST TWENTY EIGTH%MARY PLAYS THE PIANO%.48. INTO A CONTROLLER BEAMS A CONCRETE TERRORIST%.49. I OFTEN SEE THE TIME ELEVEN ELEVEN ON CLOCKS%.50. IT WAS GETTING DARK%AND WE WEREN’T THERE YET%.51. AGAINST EVERY RHYME STARVES A CHORAL APPARATUS%.52. EVERYONE WAS BUSY%SO I WENT TO THE MOVIE ALONE%.53. I CHECKED TO MAKE SURE THAT HE WAS STILL ALIVE%.54. A DOMINANT VEGETARIAN SHIES AWAY FROM THE G O P%.55. JOE MADE THE SUGAR COOKIES%SUSAN DECORATED THEM%.56. I WANT TO BUY A ONESIE%BUT KNOW IT WON’T SUIT ME%.57. A FORMER OVERRIDE OF Q W E R T Y OUTSIDE THE POPE%.58. F B I SAYS THAT C I A SAYS%I’LL STAY AWAY FROM IT%.59. ANY CLIMBING DISH LISTENS TO A CUMBERSOME FORMULA%.60. SHE WROTE HIM A LONG LETTER%BUT HE DIDN’T READ IT%.61. DEAR%BEAUTY IS IN THE HEAT NOT PHYSICAL%I LOVE YOU%.62. AN APPEAL ON JANUARY FIFTH DUPLICATES A SHARP QUEEN%.63. A FAREWELL SOLOS ON MARCH TWENTY THIRD SHAKES NORTH%.64. HE RAN OUT OF MONEY%SO HE HAD TO STOP PLAYING POKER%.65. FOR EXAMPLE%A NEWSPAPER HAS ONLY REGIONAL DISTRIBUTION T%.66. I CURRENTLY HAVE FOUR WINDOWS OPEN UP%AND I DON’T KNOW WHY%.67. NEXT TO MY INDIRECT VOCAL DECLINES EVERY UNBEARABLE ACADEMIC%.68. OPPOSITE HER SOUNDING BAG IS A M C’S CONFIGURED THOROUGHFARE%.69. FROM APRIL EIGHTH TO THE PRESENT%I ONLY SMOKE FOUR CIGARETTES%.70. I WILL NEVER BE THIS YOUNG AGAIN%EVER%OH DAMN%I JUST GOT OLDER%.71. A GENEROUS CONTINUUM OF AMAZON DOT COM IS THE CONFLICTING WORKER%.72. SHE ADVISED HIM TO COME BACK AT ONCE%THE WIFE LECTURES THE BLAST%.73. A SONG CAN MAKE OR RUIN A PERSON’S DAY IF THEY LET IT GET TO THEM%.74. SHE DID NOT CHEAT ON THE TEST%FOR IT WAS NOT THE RIGHT THING TO DO%.

Page 13: Non-Autoregressive Neural Text-to-Speech · 2020-07-01 · ClariNet (Ping et al.,2018a), and Transformer TTS (Li et al.,2019). In particular, Deep Voice 3, Tacotron and Char2Wav employ

Non-Autoregressive Neural Text-to-Speech

75. HE SAID HE WAS NOT THERE YESTERDAY%HOWEVER%MANY PEOPLE SAW HIM THERE%.76. SHOULD WE START CLASS NOW%OR SHOULD WE WAIT FOR EVERYONE TO GET HERE%?77. IF PURPLE PEOPLE EATERS ARE REAL%WHERE DO THEY FIND PURPLE PEOPLE TO EAT%?78. ON NOVEMBER EIGHTEENTH EIGHTEEN TWENTY ONE%A GLITTERING GEM IS NOT ENOUGH%.79. A ROCKET FROM SPACE X INTERACTS WITH THE INDIVIDUAL BENEATH THE SOFT FLAW%.80. MALLS ARE GREAT PLACES TO SHOP%I CAN FIND EVERYTHING I NEED UNDER ONE ROOF%.81. I THINK I WILL BUY THE RED CAR%OR I WILL LEASE THE BLUE ONE%THE FAITH NESTS%.82. ITALY IS MY FAVORITE COUNTRY%IN FACT%I PLAN TO SPEND TWO WEEKS THERE NEXT YEAR%.83. I WOULD HAVE GOTTEN W W W DOT GOOGLE DOT COM%BUT MY ATTENDANCE WASN’T GOODENOUGH%.84. NINETEEN TWENTY IS WHEN WE ARE UNIQUE TOGETHER UNTIL WE REALISE%WE ARE ALL THESAME%.85. MY MUM TRIES TO BE COOL BY SAYING H T T P COLON SLASH SLASH W W W B A I D U DOT COM%.86. HE TURNED IN THE RESEARCH PAPER ON FRIDAY%OTHERWISE%HE EMAILED A S D F AT YAHOO DOTORG%.87. SHE WORKS TWO JOBS TO MAKE ENDS MEET%AT LEAST%THAT WAS HER REASON FOR NOT HAVINGTIME TO JOIN US%.88. A REMARKABLE WELL PROMOTES THE ALPHABET INTO THE ADJUSTED LUCK%THE DRESS DODGESACROSS MY ASSAULT%.89. A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ONE TWO THREE FOUR FIVE SIX SEVEN EIGHTNINE TEN%.90. ACROSS THE WASTE PERSISTS THE WRONG PACIFIER%THE WASHED PASSENGER PARADES UNDERTHE INCORRECT COMPUTER%.91. IF THE EASTER BUNNY AND THE TOOTH FAIRY HAD BABIES WOULD THEY TAKE YOUR TEETH ANDLEAVE CHOCOLATE FOR YOU%?92. SOMETIMES%ALL YOU NEED TO DO IS COMPLETELY MAKE AN ASS OF YOURSELF AND LAUGH IT OFFTO REALISE THAT LIFE ISN’T SO BAD AFTER ALL%.93. SHE BORROWED THE BOOK FROM HIM MANY YEARS AGO AND HASN’T YET RETURNED IT%WHYWON’T THE DISTINGUISHING LOVE JUMP WITH THE JUVENILE%?94. LAST FRIDAY IN THREE WEEK’S TIME I SAW A SPOTTED STRIPED BLUE WORM SHAKE HANDS WITH ALEGLESS LIZARD%THE LAKE IS A LONG WAY FROM HERE%.95. I WAS VERY PROUD OF MY NICKNAME THROUGHOUT HIGH SCHOOL BUT TODAY%I COULDN’T BEANY DIFFERENT TO WHAT MY NICKNAME WAS%THE METAL LUSTS%THE RANGING CAPTAIN CHARTERSTHE LINK%.96. I AM HAPPY TO TAKE YOUR DONATION%ANY AMOUNT WILL BE GREATLY APPRECIATED%THE WAVESWERE CRASHING ON THE SHORE%IT WAS A LOVELY SIGHT%THE PARADOX STICKS THIS BOWL ON TOPOF A SPONTANEOUS TEA%.97. A PURPLE PIG AND A GREEN DONKEY FLEW A KITE IN THE MIDDLE OF THE NIGHT AND ENDED UPSUNBURNT%THE CONTAINED ERROR POSES AS A LOGICAL TARGET%THE DIVORCE ATTACKS NEAR AMISSING DOOM%THE OPERA FINES THE DAILY EXAMINER INTO A MURDERER%.98. AS THE MOST FAMOUS SINGLER-SONGWRITER%JAY CHOU GAVE A PERFECT PERFORMANCE INBEIJING ON MAY TWENTY FOURTH%TWENTY FIFTH%AND TWENTY SIXTH TWENTY THREE ALL THEFANS THOUGHT HIGHLY OF HIM AND TOOK PRIDE IN HIM ALL THE TICKETS WERE SOLD OUT%.99. IF YOU LIKE TUNA AND TOMATO SAUCE%TRY COMBINING THE TWO%IT’S REALLY NOT AS BAD ASIT SOUNDS%THE BODY MAY PERHAPS COMPENSATES FOR THE LOSS OF A TRUE METAPHYSICS%THECLOCK WITHIN THIS BLOG AND THE CLOCK ON MY LAPTOP ARE ONE HOUR DIFFERENT FROM EACHOTHER%.100. SOMEONE I KNOW RECENTLY COMBINED MAPLE SYRUP AND BUTTERED POPCORN THINKING ITWOULD TASTE LIKE CARAMEL POPCORN%IT DIDN’T AND THEY DON’T RECOMMEND ANYONE ELSEDO IT EITHER%THE GENTLEMAN MARCHES AROUND THE PRINCIPAL%THE DIVORCE ATTACKS NEAR AMISSING DOOM%THE COLOR MISPRINTS A CIRCULAR WORRY ACROSS THE CONTROVERSY%.