Language Style Transfer from Sentences with Arbitrary ... · the sentence style and the target style. The sec-ond is a cycle consistency loss, which ensures that the transferred sentence

Language Style Transfer from Sentences with Arbitrary Unknown Styles

Yanpeng Zhao1∗, Wei Bi2†, Deng Cai, Xiaojiang Liu2, Kewei Tu1, Shuming Shi21ShanghaiTech University, Shanghai, China

2Tencent AI Lab, Shenzhen, China{zhaoyp1,tukw}@shanghaitech.edu.cn, [email protected]

{victoriabi,kieranliu,shumingshi}@tencent.com

AbstractLanguage style transfer is the problem of mi-grating the content of a source sentence to atarget style. In many of its applications, par-allel training data are not available and sourcesentences to be transferred may have arbitraryand unknown styles. Under this problem set-ting, we propose an encoder-decoder frame-work. First, each sentence is encoded into itscontent and style latent representations. Then,by recombining the content with the targetstyle, we decode a sentence aligned in thetarget domain. To adequately constrain theencoding and decoding functions, we couplethem with two loss functions. The first is astyle discrepancy loss, enforcing that the stylerepresentation accurately encodes the style in-formation guided by the discrepancy betweenthe sentence style and the target style. The sec-ond is a cycle consistency loss, which ensuresthat the transferred sentence should preservethe content of the original sentence disentan-gled from its style. We validate the effective-ness of our model in three tasks: sentimentmodification of restaurant reviews, dialog re-sponse revision with a romantic style, and sen-tence rewriting with a Shakespearean style.

1 Introduction

Style transfer aims at migrating the content of asample from a source style to a target style. Re-cently, great progress has been achieved by apply-ing deep neural networks to redraw an image ina particular style (Kulkarni et al., 2015; Liu andTuzel, 2016; Gatys et al., 2016; Zhu et al., 2017;Luan et al., 2017). However, so far very few ap-proaches have been proposed for style transfer ofnatural language sentences, i.e., changing the styleor genre of a sentence while preserving its seman-tic content. For example, we would like a system

∗This work was done while Yanpeng Zhao was with Ten-cent AI Lab.

†Corresponding author.

user Have you ever been to Moribor before?bot Yes, I have been to Moribor twice.

transferred Yes, I have been to Moribor twice, it is beautiful.user How do you like you neighbors there?bot They were very polite and quiet throughout the night.

transferred They were very kind and made me feel comfortable.user What was the weather like during your stay?bot The muggy weather made me cranky.

transferred The warm sunlight made it a great day.user I heard you even met your father there.bot Um, I thought he would not go there.

transferred Um, I was very glad to meet him there.user How is he getting on?bot Terrible! No one wants to make friends with him.

transferred Great! Everyone wants to be his friend.

Table 1: Chatbot responses have inconsistent sentiments.Transferred responses have a consistent positive sentiment re-gardless of the sentiments of the original responses (bot).

that can convert a given text piece in the languageof Shakespeare (Mueller et al., 2017); or rewriteproduct reviews with a favored sentiment (Shenet al., 2017); or generate responses with a consis-tent persona (Li et al., 2016).

A big challenge faced by language style trans-fer is that large-scale parallel data are unavailable.However, parallel data are necessary for mosttext generation frameworks, such as the popu-lar sequence-to-sequence models (Sutskever et al.,2014; Bahdanau et al., 2014; Rush et al., 2015;Nallapati et al., 2016; Paulus et al., 2017). Hencethese methods are not applicable to the languagestyle transfer problem. A few approaches havebeen proposed to deal with non-parallel data (Huet al., 2017; Shen et al., 2017). Most of these ap-proaches try to learn a latent representation of thecontent disentangled from the source style, andthen recombine it with the target style to generatethe corresponding sentence.

All the above approaches assume that data haveonly two styles, and their task is to transfer sen-tences from one style to the other. However, inmany practical settings, we may deal with sen-tences with arbitrary unknown styles. Consider we

arX

iv:1

808.

0407

1v1

[cs

.CL

] 1

3 A

ug 2

018

are building chatbots. A good chatbot needs to ex-hibit a consistent persona, so that it can gain thetrust of users. However, existing chatbots such asSiri, Cortana, and XiaoIce (Shum et al., 2018) lackthe ability of generating responses with a consis-tent persona during the whole conversation. Table1 shows some examples. The chatbot responds tothe user with varying sentiments (neutral, positive,or negative). One possible solution is to transferthe generated chatbot responses into a target per-sona before sending them to users. Hence, in thispaper, we study the setting of language style trans-fer in which the source data to be transferred canhave arbitrary unknown styles.

Another challenge in language style transfer isthat the transferred sentence should preserve thecontent of the original sentence disentangled fromits style. To tackle this problem, Shen et al. (2017)assumed the source and target domain share thesame latent content space, and trained their modelby aligning these two latent spaces. Hu et al.(2017) constrained that the latent content repre-sentation of the original sentence could be inferredfrom the transferred sentence. However, these at-tempts considered content modification in the la-tent content space but not the sentence space.

The contribution of this paper mainly consistsof the following three parts:

1. We address a new style transfer task where sen-tences in the source domain can have arbitrary lan-guage styles but those in the target domain are withonly one language style.2. We propose a style discrepancy loss to learndisentangled representations of content and style.This loss enforces that the discrepancy between anarbitrary style representation and the target stylerepresentation should be consistent with the close-ness of its sentence style to the target style. Ad-ditionally, we employ a cycle consistency to avoidcontent change.3. We evaluate our model in three tasks: senti-ment modification of restaurant reviews, dialog re-sponse revision with a romantic style, and sen-tence rewriting with a Shakespearean style. Exper-imental results show that our model surpasses thestate-of-the-art style transfer model (Shen et al.,2017) in these three tasks.

2 Related Work

Image Style Transfer: Most style transfer ap-proaches in the literatures focus on vision data.

Kulkarni et al. (2015) proposed to disentangle thecontent representations from image attributes, andcontrol the image generation by manipulating thegraphics code that encodes the attribute informa-tion. Gatys et al. (2016) used ConvolutionalNeural Networks (CNNs) to learn separated rep-resentations of the image content and style, andthen created the new image from their combina-tion. Some approaches have been proposed toalign the two data domains with the idea of gen-erative adversarial networks (GANs) (Goodfellowet al., 2014). Liu and Tuzel (2016) proposed acoupled GAN framework to learn a joint distribu-tion of multi-domain data by the weight-sharingconstraint. Zhu et al. (2017) introduced a cy-cle consistency loss, which minimizes the gap be-tween the transferred images and the original ones.However, due to the discreteness of the natural lan-guage, this loss function cannot be directly appliedon text data. In our work, we show how the ideaof cycle consistency can be used on text data.

Text Style Transfer: To handle the non-paralleldata problem, Mueller et al. (2017) revised thelatent representation of a sentence in a certaindirection guided by a classifier, so that the de-coded sentence imitates those favored by the clas-sifier. Ficler and Goldberg (2017) encoded tex-tual property values with embedding vectors, andadopted a conditioned language model to gener-ate sentences satisfying the specified content andstyle properties. Li et al. (2018) demonstrated thata simple delete-retrieve-generate approach couldachieve good performance in sentiment transfertasks. Hu et al. (2017) used the variational auto-encoder (VAE) (Kingma and Welling, 2013) to en-code the sentence into a latent content representa-tion disentangled from the source style, and thenrecombine it with the target style to generate itscounterpart. Shen et al. (2017) considered trans-ferring between two styles simultaneously. Theyutilized adversarial training to align the generatedsentences from one style to the data domain ofthe other style. We also adopt similar adversarialtraining in our model. However, since we assumethe source domain contains data with various andpossibly unknown styles, it is impossible for us toapply a discriminator to determine whether a sen-tence transferred from the target domain is alignedin the source domain as in Shen et al. (2017).

3 Problem Formulation

We now formally present our problem formu-lation. Suppose there are two data domains,one source domain Xs in which each sentencemay have its own language style, and one tar-get domain Xt consisting of data with the samelanguage style. During training, we observe nsamples from Xs and m samples from Xt, de-noted as Xs = {x(1)

s ,x(2)s , . . . ,x

(n)s } and Xt =

{x(1)t ,x

(2)t , . . . ,x

(m)t }. Note that we can hardly

find a sentence pair (x(i)s ,x

(j)t ) that describes the

same content. Our task is to design a model tolearn from these non-parallel training data suchthat for an unseen testing sentence x ∈ Xs, wecan transfer it into its counterpart x̃ ∈ Xt, where x̃should preserve the content of x but with the lan-guage style in Xt.

4 Model

4.1 Encoder-decoder Framework

We assume each sentence x can be decomposedinto two representations: one is the style represen-tation y ∈ Y , and the other is the content repre-sentation z ∈ Z , which is disentangled from itsstyle. Each sentence x

(i)s ∈ Xs has its individ-

ual style y(i)s , while all the sentences x

(i)t ∈ Xt

share the same style, denoted as y∗. Our modelis built upon the encoder-decoder framework. Inthe encoding module, we assume that z and y of asentence x can be obtained through two encodingfunctions Ez(x) and Ey(x) respectively:

z(i)s = Ez(x(i)s ), y(i)

s = Ey(x(i)s ) ,

z(j)t = Ez(x

(j)t ), y∗ = Ey(x

(j)t ) ,

where i ∈ {1, 2, . . . , n}, j ∈ {1, 2, . . . ,m},Ey(x) = 1{x∈Xs} · g(x) + 1{x∈Xt} · y∗, and1{·} is an indicator function. When a sentencex comes from the source domain, we use a func-tion g(x) to encode its style representation. For xfrom the target domain, a shared style representa-tion y∗ is used. Both y∗ and parameters in g(x)are learnt jointly together with other parameters inour model.

For the decoding module, we first employ a re-construction loss to encourage that the sentencefrom the decoding function given z and y of a sen-tence x can well reconstruct x itself. Here, we usea probabilistic generator G as the decoding func-

tion and the reconstruction loss is:

Lrec(θEz ,θEy ,θG)

=Exs∼Xs [− log pG(xs|ys, zs)]

+Ext∼Xt [− log pG(xt|y∗, zt)] , (1)

where θ denotes the parameter of the correspond-ing module.

To enable style transfer using non-parallel train-ing data, we enforce that for a sample xs ∈ Xs, itsdecoded sequence using G given its content repre-sentation z and the target style y∗ should be in thetarget domain Xt. We use the idea of GAN (Good-fellow et al., 2014)) and introduce an adversar-ial loss to be minimized in decoding. The goalof the discriminator D is to distinguish betweenG(zs,y

∗) and G(zt,y∗), while the generator tries

to bewilder the discriminator:

Ladv(θD,θG,θEz ,θEy)

=Exs∼Xs [− log(1−D(G(zs,y∗)))]

+Ext∼Xt [− logD(G(zt,y∗))] . (2)

Remarks: The above encoder-decoder frameworkis under-constrained in two aspects:

1. For a sample xs ∈ Xs, ys can be an arbitraryvalue that minimizes the above losses in Equa-tion 1 and 2, which may not necessarily capturethe sentence style. This will affect the other de-composed part z, making it not fully represent thecontent which should be invariant with the style.2. The discriminator can only encourage the gen-erated sentence to be aligned with the target do-main X2, but cannot guarantee to keep the contentof the source sentence intact.

To address the first problem, we propose a stylediscrepancy loss, to constrain that the learnt yshould have its distance from y∗ guided by anotherdiscriminator which evaluates the closeness of thesentence style to the target style. For the secondproblem, we get inspired by the idea in He et al.(2016) and Zhu et al. (2017) and introduce a cy-cle consistency loss applicable to word sequence,which requires that the generated sentence x̃ canbe transferred back to the original sentence x.

4.2 Style Discrepancy LossBy using a portion of the training data, we canfirst train a discriminator Ds to predict whether agiven sentence x has the target language style withan output probability, denoted as pDs(x ∈ Xt).

…

s ……

Source data with unknown styles

Target data with a given style

… ……

……

Generator 𝐆𝐆 …

Discriminator 𝐃𝐃

Content encoder 𝐄𝐄𝐳𝐳

Style discrepancy 𝑑𝑑(𝐲𝐲𝑠𝑠, 𝐲𝐲∗)

Real/fake ?


Style encoder 𝐄𝐄𝐲𝐲

Generator 𝐆𝐆

Generator 𝐆𝐆

𝐲𝐲∗

𝐳𝐳𝑡𝑡

𝐳𝐳𝑠𝑠x𝑠𝑠

x𝑡𝑡�x𝑡𝑡

𝐲𝐲𝑠𝑠

Discriminator 𝐃𝐃𝑠𝑠

Loss ℒ𝑑𝑑𝑑𝑑𝑠𝑠(𝜽𝜽𝐄𝐄𝐲𝐲)𝑝𝑝𝐃𝐃𝑠𝑠(x𝑠𝑠 ∈ 𝜒𝜒𝑡𝑡)

�x𝑠𝑠

�x𝑠𝑠

(a)… …

…

…

……

……


Content encoder 𝐄𝐄𝐳𝐳 Generator 𝐆𝐆

Generator 𝐆𝐆

Loss ℒ𝑐𝑐𝑦𝑦𝑐𝑐(𝜽𝜽𝐄𝐄𝐳𝐳, 𝜽𝜽𝐄𝐄𝐲𝐲, 𝜽𝜽𝐆𝐆)

x𝑠𝑠

𝐲𝐲∗

𝐳𝐳𝑠𝑠𝐲𝐲𝑠𝑠

�x𝑠𝑠

Style encoder 𝐄𝐄𝐲𝐲

�x𝑠𝑠

�x𝑠𝑠�𝐳𝐳𝑠𝑠(b)

Figure 1: (a) Basic model with the style discrepancy loss. Solid lines: encode and decode the sample itself; dash lines: transferx1 ∈ X1 into X2. (b): Proposed cycle consistency loss (can be applied for samples in X2 similarly).

When learning the decomposed style representa-tion ys for a sample xs ∈ Xs, we enforce thatthe discrepancy between this style representationand the target style representation y∗, should beconsistent with the output probability from Ds.Specifically, since the styles are represented withembedding vectors, we measure the style discrep-ancy using the `2 norm:

d(ys,y∗) = ‖ys − y∗‖2 . (3)

Intuitively, if a sentence has a larger probabilityto be considered having the target style, its stylerepresentation should be closer to the target stylerepresentation y∗. Thus, we would like to haved(ys,y

∗) positively correlated with 1− pDs(xs ∈Xt). To incorporate this idea in our model, we usea probability density function q(ys,y

∗), and de-fine the style discrepancy loss as:

Ldis(θEy) (4)

=Exs∼Xs [−pDs(xs ∈ Xt) log q(ys,y∗)] ,

where q(ys,y∗) = f(d(ys,y

∗)) (f(·) is a validprobability density function) and Ds is pre-trainedand then fixed. If a sentence xs has a largepDs(xs ∈ Xt), incorporating the above loss intothe encoder-decoder framework will encourage alarge q(ys,y

∗) and hence a small d(ys,y∗), which

means ys will be close to y∗. In our experiment,we instantiate q(ys,y

∗) with the standard normaldistribution for simplicity:

q(ys,y∗) =

1√2π

exp(−d(ys,y∗)2

2). (5)

However, better probability density functions canbe used if we have some prior knowledge of the

style distribution. With Equation 5, the style dis-crepancy loss can be equivalently minimized by:

Ldis(θEy) (6)

=Exs∼Xs [pDs(xs ∈ Xt)d(ys,y∗)2] .

Note that Ds is not jointly trained in our model.The reason is that, if we integrate it into the end-to-end training, we may start with a Ds with a low ac-curacy, and then our model is inclined to optimizea wrong style-discrepancy loss for many epochsand get stuck into a poor local optimum.

4.3 Cycle Consistency LossInspired by He et al. (2016); Zhu et al. (2017), werequire that a sentence transferred by the genera-tor G should preserve the content of its originalsentence, and thus it should have the capacity torecover the original sentence in a cyclic manner.For a sample xs ∈ Xs with its transferred sen-tence x̃s having the target style y∗, we encode x̃s

and combine its content z̃s with its original styleys for decoding. We should expect that with a highprobability, the original sentence x1 is generated.For a sample xt ∈ Xt, though we do not aim tochange its language style in our task, we can stillcompute its cycle consistency loss for the purposeof additional regularization. We first choose an ar-bitrary style ys obtained from a sentence in Xs,and transfer xt into this ys style. Next, we putthis generated sentence into the encoder-decodermodel with the style y∗, and the original sentencext should be generated. Formally, the cycle con-sistency loss is:

Lcyc(θEz ,θEy ,θG)

=Exs∼Xs [− log pG(xs|Ez(x̃s),ys)]

+Ext∼Xt [− log pG(xt|Ez(x̃t),y∗)] . (7)

4.4 Full ObjectiveAn illustration of our basic model with the stylediscrepancy loss is shown in Figure 1(a) and thefull model combined with the cycle consistencyloss is shown in Figure 1(b). To summarize, thefull loss function of our model is:

L(θEz ,θEy ,θG,θD)

=Lrec − λ1Ladv + λ2Lcyc + λ3Ldis, (8)

where λ1, λ2, λ3 are parameters balancing therelative importance of the different loss parts.The overall training objective is a minmax gameplayed among the encoder Ez, Ey, generator Gand discriminator D:

minEz,Ey,G

maxDL(θEz ,θEy ,θG,θD) . (9)

We implement the encoder Ez using an RNN withthe last hidden state as the content representation,and the style encoder g(x) using a CNN with theoutput representation of the last layer as the stylerepresentation. The generator G is an RNN thattakes the concatenation of the content and stylerepresentations as the initial hidden state. The dis-criminator D and the pre-trained discriminator Ds

used in the style discrepancy loss are CNNs withthe similar network structure in Ey followed by asigmoid output layer.

5 Experiments

5.1 DatasetsYelp: Raw data are from the Yelp Dataset Chal-lenge Round 10, which are restaurant reviews onYelp. Generally, reviews rated with 4 or 5 stars areconsidered positive, 1 or 2 stars are negative, and3 stars are neutral. For positive and negative re-views, we use the processed data released by Shenet al. (2017), which contains 250k negative sen-tences and 350k positive sentences. For neutral re-views, we follow similar steps in Shen et al. (2017)to process and select the data. We first filter outneutral reviews (rated with 3 stars and categorizedwith the keyword ‘restaurant’) with the length ex-ceeding 15 or less than 3. Then, data selectionin Moore and Lewis (2010) is used to ensure alarge enough vocabulary overlap between the neu-tral data and the data in Shen et al. (2017). After-wards, we sample 500k sentences from the result-ing dataset as the neutral data. We use the positivedata as the target style domain. Based on the three

classes of data, we construct two source datasetswith multiple styles:

• Positive+Negative (Pos+Neg): we add differentnumbers of positive data (50k, 100k, 150k) intothe negative data so that the source domain con-tains data with two sentiments.• Neutral+Negative (Neu+Neg): we combineneutral (50k, 100k, 150k) and negative data to-gether as the source data.

We consider the Neu+Neg dataset is harder tolearn from than the Pos+Neg dataset since for thePos+Neg dataset, we can make use of a pre-trainedclassifier to possibly filter out some positive dataso that most of the source data have the same styleand the model in Shen et al. (2017) can work.However, the neutral data cannot be removed inthis way. Also, most of the real data may be inthe neutral sentiment, and we want to see if suchsentences can be transferred well.Chat: We use sentences from a real Chinese dia-log dataset as the source domain. Users can chatwith various personalized language styles, whichare not easy to be classified into one of the threesentiments as in Yelp. Romantic sentences are col-lected from several online novel websites and fil-tered by human annotators. The dataset has 400kromantic sentences and 800k general sentences.Our task is to transfer the dialog sentences to aromantic style, characterized by the selected ro-mantic sentences.Shakespeare: We experiment on revising moderntext in the Shakespearean style at the sentence-level as in (Mueller et al., 2017). Follow-ing their experimental setup, we collect 29,388sentences authored by Shakespeare and 54,800sentences from non-Shakespeare-authored works.The length of all the sentences ranges from 3 to15.

Each of the above three datasets is dividedinto three non-overlapping parts which are respec-tively used for the style transfer model, the pre-trained discriminator Ds, and the evaluation clas-sifier used in Section 5.3. Each part is furtherdivided into training, testing, and validation sets.There is one exception that Ds for Shakespeare istrained on a subset of the data for training the styletransfer model because the Shakespeare dataset issmall. Statistics of the data for training and eval-uating the style transfer model are shown in thesupplementary material.

5.2 Compared Methods and Configurations

We compare our method with Shen et al. (2017)which is the state-of-the-art language style trans-fer model with non-parallel data, and we nameas Style Transfer Baseline (STB). As describedin Section 2 and Section 4, STB is built upon anauto-encoder framework. It focuses on transfer-ring sentences from one style to the other, with thesource and target language styles represented bytwo embedding vectors. It also relies on adversar-ial training methods to align content spaces of thetwo domains. We keep the configurations of themodules in STB, such as the encoder, decoder anddiscriminator, the same as ours for a fair compari-son.

We implement our model using Tensor-flow (Abadi et al., 2016). We use GRU as the en-coder and generation cells in our encoder-decoderframework. Dropout (Srivastava et al., 2014) isapplied in GRUs and the dropout probability is setto 0.5. Throughout our experiments, we set thedimension of the word embedding, content rep-resentation and style representation as 200, 1000and 500 respectively. For the style encoder g(x),we follow the CNN architecture in Kim (2014),and use filter sizes of 200 × {1, 2, 3, 4, 5} with100 feature maps each, so that the resulting out-put layer is of size 500, i.e., the dimension of thestyle representation. The pre-trained discriminatorDs is implemented similar to g(x) but using fil-ter sizes 200 × {2, 3, 4, 5} with 250 feature mapseach. The testing accuracy of the pre-trained Ds

is 95.23% for Yelp, 87.60% for Chat, and 87.60%for Shakespeare. We further set the balancing pa-rameters λ1 = λ2 = 1, λ3 = 5, and train themodel using the Adam optimizer (Kingma andBa, 2015) with the learning rate 10−4. All inputsentences are padded so that they have the samelength 20 for Yelp and Shakespeare and 35 forChat. Furthermore, we use the pre-trained wordembeddings Glove (Pennington et al., 2014) forYelp and Shakespeare and use the Chinese wordembeddings trained on a large amount of Chinesenews data for Chat when training the classifiers.

5.3 Evaluation Setup

Following Shen et al. (2017), we use a model-based evaluation metric. Specifically, we usea pre-trained evaluation classifier to classifywhether the transferred sentence has the correctstyle. The evaluation classifier is implemented

with the same network structure as the discrimi-nator Ds. The testing accuracy of evaluation clas-sifiers is 95.36% for Yelp, 87.05% for Chat, and88.70% for Shakespeare. We repeat the trainingthree times for each experiment setting and reportthe mean accuracy on the testing data with theirstandard deviation.

5.4 Experiment Roadmap

We perform a set of experiments on Yelp, which isalso used in Shen et al. (2017), to systematicallyinvestigate the effectiveness of our model:

1. We validate the usefulness of the proposed stylediscrepancy loss and the cycle consistency loss inour setting (Sec 5.5.1).2. We vary the difficulty level of the source dataand verify the robustness of our model (Sec 5.5.2).3. We look into some transferred sentencesand analyze successful cases and failed cases(Sec 5.5.3).4. We perform human evaluations to evaluate theoverall quality of the transferred sentences andcheck if the results are consistent with those usingthe model-based evaluation (Sec 5.5.4).

After conducting the thorough study of our modelon Yelp, we apply our model on Chat and Shake-speare, two real datasets in which the source do-main has various language styles.

5.5 Experiments and Analysis on Yelp

5.5.1 Ablation Study

To investigate the effectiveness of the style dis-crepancy loss, we try our full model by remov-ing the style discrepancy loss only on the datasetPos+Neg whose source domain contains both pos-itive sentences and negative sentences. When themodel converges, we find that the cycle consis-tency loss is much larger than that obtained in thefull model for the same setting. We also manuallycheck that almost all sentences fail to transfer theirstyles with this model. This indicates that the pro-posed style discrepancy loss is an inseparable partof our model.

#positive STB Ourssamples used STB (with Cyc) (without Cyc) Ours

50k 0.904± 0.033 0.912± 0.017 0.890± 0.027 0.943± 0.007100k 0.800± 0.043 0.846± 0.031 0.862± 0.010 0.940± 0.005150k 0.535± 0.086 0.678± 0.033 0.631± 0.168 0.934± 0.005

Table 2: Testing accuracies on Yelp with Pos+Neg sourcedata.

We also validate the effectiveness of the cycleconsistency loss 1. Specifically, we compare twoversions of both STB and our model, one with thecycle consistency loss and one without. We varythe number of positive sentences in the source do-main and results are shown in Table 2. It can beseen that incorporating the cycle consistency lossconsistently improves the performance for bothSTB and our proposed model.

5.5.2 Difficulty Level of Training DataPos+Neg as Source Data:We compare STB andour proposed model on the first dataset Pos+Negusing the results in Table 2. As the number ofpositive sentences in the source data increases,the average performance of both versions of STBdecreases drastically. This is reasonable becauseSTB introduces a discriminator to align the sen-tences from the target domain back to the sourcedomain, and when the source domain containsmore positive samples, it is hard to find a goodalignment to the source domain. Meanwhile theperformance of our model, even the basic onewithout the cycle consistency loss, does not fluctu-ate much with the increase of the number of pos-itive samples, showing that our model is not thatsensitive to the source data containing more thanone sentiments. Overall, our model with the cycleconsistency loss performs the best.

#Neural samples STB (with Cyc) Ours

50k 0.929± 0.011 0.946± 0.001100k 0.937± 0.018 0.939± 0.000150k 0.926± 0.019 0.936± 0.006

Table 3: Testing accuracies on Yelp with Neu+Neg sourcedata.

Neu+Neg as Source Data: The dataset Pos+Negis not so challenging because we can use a pre-trained discriminator similar to Ds in our model,to remove those samples classified as positive withhigh probabilities, so that only sentences witha less positive sentiment remain in the sourcedomain. Thus, we test on our second datasetNeu+Neg. In this setting, in case that some posi-tive sentences exist in those neutral reviews, whenSTB is trained, we use the same pre-trained dis-criminator in our model to filter out samples clas-sified as positive with probabilities larger than 0.9.In comparison, our model uses all data, since itnaturally allows for those data with styles similar

1Note that our proposed cycle consistency loss can besimilarly added in STB.

to the target style. Experimental results in Table 3show that with the same amount of neutral datamixed in the source domain, our model performsbetter than STB (with Cyc) and is relatively stableamong multiple runs.Limited Sentences in Target Domain: In real ap-plications, there may be only a small amount ofdata in the target domain. To simulate this sce-nario, we limit the amount of the target data (ran-domly sampled from the positive data) used fortraining, and evaluate the robustness of the com-pared methods. Table 4 shows the experimentalresults. It is surprising to see that both methodsobtain relatively steady accuracies with differentnumbers of target samples. Yet, our model sur-passes STB (with Cyc) in all the cases.

#Target samples used STB (with Cyc) Ours

100k 0.742± 0.009 0.894± 0.010150k 0.756± 0.020 0.907± 0.003200k 0.735± 0.022 0.906± 0.012

Table 4: Testing accuracies on Yelp with different numbersof target samples used.

5.5.3 Case StudyWe manually examine the generated sentences fora detailed study. Overall, our full model can gen-erate grammatically correct positive reviews with-out changing the original content in more casesthan the other methods. In Table 5, we presentsome example results of the various methods. Wecan see that when the original sentences are sim-ple such as the first example, all models can trans-fer the sentence successfully. However, when theoriginal sentences are complex, both versions ofSTB and our basic model (without the cycle con-sistency loss) cannot generate fluent sentences, butour full model still succeeds. One minor problemwith our model is that it may use a wrong tensein transferred sentences (i.e., the last transferredsentence by our model), which however does notinfluence the sentence style and meaning a lot.

5.5.4 Human EvaluationThe model-based evaluation metric is inadequateat measuring whether a transferred sentence pre-serves the content of a source sentence (Fu et al.,2017). Therefore, we rely on human evaluationsto evaluate model performance in content preser-vation. We randomly select 200 test samples fromYelp and perform human evaluations to estimate

Original Sentence and just not very good .

STB but i was very good .STB (with Cyc) and just always very good .

Ours (without Cyc) and just always very good .Ours and i love it .

Original Sentence i have tried to go to them twice silly me .

STB i ’m going to anyone when they need to .STB (with Cyc) i ’ve been to go here for years out .

Ours (without Cyc) i have recommend to anyone ’s your home needs .Ours i have tried the place and it was great .

Original Sentence i am so thankful to be out of this place .

STB i am so impressed to the experience i had .STB (with Cyc) i am always impressed to see for a family .

Ours (without Cyc) i am so grateful for being on this place again .Ours i am so happy to have found this place .

Original Sentence service was okay but joseph is just rude .

STB service was well , but well .STB (with Cyc) service was friendly and everyone is .

Ours (without Cyc) service was quick , and just funny .Ours service is great and food was great .

Original Sentence they were very loud and made noise throughout the night .

STB they were very good and made my own water .STB (with Cyc) they were very good and made out on the game .

Ours (without Cyc) they were very nice and made the other time .Ours they are very friendly and made you feel comfortable .

Table 5: Example sentences on Yelp transferred into a posi-tive sentiment.

the overall quality of transferred sentences ratingfrom 1 (failed), 2 (tolerable), 3 (satisfying) and4 (perfect) by jointly considering content preser-vation, sentiment modification, and fluency of thetransferred sentences.

We hire five annotators to evaluate the results.Since we have 9 settings and totally 24 differentmethods in Table 2-4, here we select one of thesettings on Yelp due to limited budgets. Table 6shows the averaged scores of the annotator’s eval-uations. As can be seen, by considering all theabove three aspects, our model is better than othermethods. This result is also consistent with the au-tomatic evaluation ones.

STB (with Cyc) (without Cyc) Ours

Overall 2.352± 0.288 2.545± 0.206 2.771± 0.290 2.805± 0.314

Table 6: Human evaluation on Yelp when 150k positive sen-tences are added to source domain (row 3 in Table 2).

#target samples used STB (with Cyc) Ours

10k 0.880± 0.016 0.962± 0.00250k 0.927± 0.012 0.980± 0.003100k 0.940± 0.005 0.968± 0.002150k 0.943± 0.004 0.965± 0.003

Table 7: Testing accuracies on Chat with different numbersof target samples used.

5.6 Experiments and Analysis on Chat

As in the Yelp experiment, we vary the numberof target sentences to test the robustness of thecompared methods. The experimental results areshown in Table 7. Several observations can bemade. First, STB (with Cyc) obtains a relativelylow performance with only 10k target samples. Asmore target samples are used, its performance in-creases. Second, our model achieves a high accu-racy even with 10k target samples used, and re-mains stable in all the cases. Thus, our modelachieves better performance as well as stronger ro-bustness on Chat. Due to space limitations, wepresent a few examples in Table 10 in the supple-mentary material. We find that our model gener-ally successfully transfers the sentence into a ro-mantic style with some romantic phrases used.

5.7 Experiments and Analysis onShakespeare

The testing accuracies are shown in Table 8. Wealso present some example sentences in Table 9.Compared with STB, our model can generate sen-tences which are more fluent and have a higherprobability to have a correct target style. For ex-ample, in the sentences transferred by our model,the words such as ’sir’, ’master’, and ’lustre’ areused, which are common in the Shakespeareanworks. However, we find that both STB and ourmodel tend to generate short sentences and changethe content of source sentences in more cases inthis set of experiment than in the Yelp and Chatdatasets. We conjecture this is caused by thescarcity of training data. Sentences in the Shake-spearean style form a vocabulary of 8559 words,but almost 60% of them appear less than 10 times.On the other hand, the source domain contains19962 words, but there are only 5211 commonwords in these two vocabularies. Thus alignedwords/phrases may not exist in the dataset.

#target samples used STB (with Cyc) Ours

21,888 0.956± 0.014 0.965± 0.009

Table 8: Testing accuracies on Shakespeare.

6 Conclusion

We have presented an encoder-decoder frameworkfor language style transfer. It allows for the use of

Original Sentence i should never have thought of such a thing .

STB (with Cyc) i shall not have to for you .Ours i will never be thee , sir .

Original Sentence do n’t try to make any stupid moves .

STB (with Cyc) i think you have to your .Ours do you walk , the master ?

Original Sentence where were the hardships she had expected ?

STB (with Cyc) what is it is it ?Ours where are the lustre here ?

Table 9: Example non-Shakespeare sentences transferredinto a Shakespearean language style.

non-parallel data, where the source data have var-ious unknown language styles. Each sentence isencoded into two latent representations, one corre-sponding to its content disentangled from the styleand and the other representing the style only. Byrecombining the content with the target style, wecan decode a sentence aligned in the target do-main. Specifically, we propose two loss functions,i.e., the style discrepancy loss and the cycle con-sistency loss, to adequately constrain the encod-ing and decoding functions. The style discrepancyloss is used to enforce a properly encoded stylerepresentation while the cycle consistency loss isutilized to ensure that the style-transferred sen-tences can be transferred back to their original sen-tences. Experimental results in three tasks demon-strate that our proposed method outperforms pre-vious style transfer methods.

ReferencesMartı́n Abadi, Ashish Agarwal, Paul Barham, Eugene

Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado,Andy Davis, Jeffrey Dean, Matthieu Devin, et al.2016. Tensorflow: Large-scale machine learning onheterogeneous distributed systems. arXiv preprintarXiv:1603.04467.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473.

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,and Yoshua Bengio. 2014. Empirical evaluation ofgated recurrent neural networks on sequence model-ing. arXiv preprint arXiv:1412.3555.

Jessica Ficler and Yoav Goldberg. 2017. Controllinglinguistic style aspects in neural language genera-tion. In Proceedings of the Workshop on StylisticVariation, pages 94–104.

Zhenxin Fu, Xiaoye Tan, Nanyun Peng, DongyanZhao, and Rui Yan. 2017. Style transfer intext: Exploration and evaluation. arXiv preprintarXiv:1711.06861.

Leon A Gatys, Alexander S Ecker, and MatthiasBethge. 2016. Image style transfer using convolu-tional neural networks. In Proceedings of the IEEEConference on Computer Vision and Pattern Recog-nition, pages 2414–2423.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014. Generative ad-versarial nets. In Advances in Neural InformationProcessing Systems, pages 2672–2680.

Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu,Tieyan Liu, and Wei-Ying Ma. 2016. Dual learn-ing for machine translation. In Advances in NeuralInformation Processing Systems, pages 820–828.

Zhiting Hu, Zichao Yang, Xiaodan Liang, RuslanSalakhutdinov, and Eric P Xing. 2017. Toward con-trolled generation of text. In Proceedings of the In-ternational Conference on Machine Learning, pages1587–1596.

Yoon Kim. 2014. Convolutional neural networks forsentence classification. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1746–1751.

Diederik Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In Proceedingsof the International Conference for Learning Repre-sentations.

Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114.

Tejas D Kulkarni, William F Whitney, Pushmeet Kohli,and Josh Tenenbaum. 2015. Deep convolutional in-verse graphics network. In Advances in Neural In-formation Processing Systems, pages 2539–2547.

Jiwei Li, Michel Galley, Chris Brockett, Georgios PSpithourakis, Jianfeng Gao, and Bill Dolan. 2016.A persona-based neural conversation model. In Pro-ceedings of the Annual Meeting of the Associationfor Computational Linguistics, pages 994–1003.

Juncen Li, Robin Jia, He He, and Percy Liang.2018. Delete, retrieve, generate: A simple approachto sentiment and style transfer. arXiv preprintarXiv:1804.06437.

Ming-Yu Liu and Oncel Tuzel. 2016. Coupled gener-ative adversarial networks. In Advances in NeuralInformation Processing Systems, pages 469–477.

Fujun Luan, Sylvain Paris, Eli Shechtman, and KavitaBala. 2017. Deep photo style transfer. arXivpreprint arXiv:1703.07511.

Robert C Moore and William Lewis. 2010. Intelligentselection of language model training data. In Pro-ceedings of the Annual Meeting of the Associationfor Computational Linguistics, pages 220–224.

Jonas Mueller, David Gifford, and Tommi Jaakkola.2017. Sequence to better sequence: continuous revi-sion of combinatorial structures. In Proceedings ofthe International Conference on Machine Learning,pages 2536–2544.

Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre,Bing Xiang, et al. 2016. Abstractive text summa-rization using sequence-to-sequence rnns and be-yond. arXiv preprint arXiv:1602.06023.

Romain Paulus, Caiming Xiong, and Richard Socher.2017. A deep reinforced model for abstractive sum-marization. arXiv preprint arXiv:1705.04304.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of the Conference onEmpirical Methods in Natural Language Process-ing, pages 1532–1543.

Alexander M Rush, Sumit Chopra, and Jason Weston.2015. A neural attention model for abstractive sen-tence summarization. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing, pages 379–389.

Tianxiao Shen, Tao Lei, Regina Barzilay, and TommiJaakkola. 2017. Style transfer from non-parallel textby cross-alignment. In Advances in Neural Informa-tion Processing Systems.

Heung-Yeung Shum, Xiaodong He, and Di Li. 2018.From eliza to xiaoice: Challenges and opportunitieswith social chatbots. CoRR.

Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: a simple way to prevent neural networksfrom overfitting. Journal of machine learning re-search, 15(1):1929–1958.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In Advances in Neural Information Process-ing Systems, pages 3104–3112.

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei AEfros. 2017. Unpaired image-to-image translationusing cycle-consistent adversarial networks. In Pro-ceedings of the International Conference on Com-puter Vision.

Supplementary Materials

This supplementary material contains the fol-lowing contents. (1) Table 10–18 show statisticsof the data used in the experiments. (2) Table 19shows some transferred sentences from the testingdata of Chat.

Training Test Validation

Positive 240417 40000 20000Negative 151026 40000 20000

Table 10: Statistics of Yelp for the style transfer model


Romantic 207312 40000 40000General 514460 40000 40000

Table 11: Statistics of Chat for the style-transfer model


Shakespeare 21888 1000 2000Non-Shakespeare 43800 1000 2000

Table 12: Statistics of Shakespeare for the style transfermodel



Table 13: Statistics of Yelp for the discriminator Ds



Table 14: Statistics of Yelp for the evaluation classifier



Table 15: Statistics of Chat for the discriminator Ds



Table 16: Statistics of Chat for the evaluation classifier



Table 17: Statistics of Shakespeare for the discriminator Ds



Table 18: Statistics of Shakespeare for the evaluation classi-fier

Original Sentence 回眸一笑就好 It is enough to look back and smile

STB (with Cyc) 回眸一笑就好了 It would be just fine to look back and smileOurs 回眸一笑 , 勿念。 Look back and smile, please do not miss me.

Original Sentence 得过且过吧 ! Just live with it!

STB (with Cyc) 想不开吧 , 我的吧。 I just take things too hard. *Ours 爱到深处 , 随遇而安。 Love to the depths, enjoy myself wherever I am.

Original Sentence 自己的幸福给别人了 Give up your happiness to others

STB (with Cyc) 自己的幸福给别人 , 你的。 Give up your happiness to others. *Ours 自己的幸福是自己 , 自己的。 Leave some happiness to yourself, yourself.

Table 19: Example sentences on Chat transferred into a romantic style. English translations are provided (* denotes that thesentence has grammar mistakes in Chinese).

Documents

Language Style Transfer from Sentences with Arbitrary ... · the sentence style and the target style. The sec-ond is a cycle consistency loss, which ensures that the transferred sentence