Learning to Encode Text as Human-Readable Summaries using ...speech.ee.ntu.edu.tw/~tlkagk/paper/learning-encode-text.pdf · is shorter than the source text; in order to recon-struct

Learning to Encode Text as Human-Readable Summaries usingGenerative Adversarial Networks

Yau-Shian WangNational Taiwan [email protected]

Hung-Yi LeeNational Taiwan University

[email protected]

Abstract

Auto-encoders compress input data into alatent-space representation and reconstruct theoriginal data from the representation. This la-tent representation is not easily interpreted byhumans. In this paper, we propose trainingan auto-encoder that encodes input text intohuman-readable sentences, and unsupervisedabstractive summarization is thereby achieved.The auto-encoder is composed of a generatorand a reconstructor. The generator encodes theinput text into a shorter word sequence, andthe reconstructor recovers the generator inputfrom the generator output. To make the gen-erator output human-readable, a discriminatorrestricts the output of the generator to resem-ble human-written sentences. By taking thegenerator output as the summary of the in-put text, abstractive summarization is achievedwithout document-summary pairs as trainingdata. Promising results are shown on both En-glish and Chinese corpora.

1 IntroductionWhen it comes to learning data representations,a popular approach involves the auto-encoder ar-chitecture, which compresses the data into a la-tent representation without supervision. In thispaper we focus on learning text representations.Because text is a sequence of words, to encode asequence, a sequence-to-sequence (seq2seq) auto-encoder (Li et al., 2015; Kiros et al., 2015) is usu-ally used, in which a RNN is used to encode theinput sequence into a fixed-length representation,after which another RNN is used to decode theoriginal input sequence given this representation.

Although the latent representation learned bythe seq2seq auto-encoder can be used in down-stream applications, it is usually not human-readable. In this work, we use comprehensi-ble natural language as a latent representation ofthe input source text in an auto-encoder architec-

ture. This human-readable latent representationis shorter than the source text; in order to recon-struct the source text, it must reflect the core ideaof the source text. Intuitively, the latent represen-tation can be considered a summary of the text, sounsupervised abstractive summarization is therebyachieved.

The idea that using human comprehensible lan-guage as a latent representation has been ex-plored on text summarization, but only in a semi-supervised scenario. Previous work (Miao andBlunsom, 2016) uses a prior distribution from apre-trained language model to constrain the gen-erated sequence to natural language. However,to teach the compressor network to generate textsummaries, the model is trained using labeleddata. In contrast, in this work we need no labeleddata to learn the representation.

As shown in Fig. 1, the proposed model is com-posed of three components: a generator, a discrim-inator, and a reconstructor. Together, the gener-ator and reconstructor form a text auto-encoder.The generator acts as an encoder in generatingthe latent representation from the input text. In-stead of using a vector as latent representation,however, the generator generates a word sequencemuch shorter than the input text. From the shortertext, the reconstructor reconstructs the original in-put of the generator. By minimizing the recon-struction errors, the generator learns to generateshort text segments that contain the main infor-mation in the original input. We use the seq2seqmodel in modeling the generator and reconstruc-tor because both have input and output sequenceswith different lengths.

However, it is very possible that the gener-ator’s output word sequence can only be pro-cessed and recognized by the reconstructor but isnot readable by humans. Here, instead of reg-ularizing the generator output with a pre-trained

language model (Miao and Blunsom, 2016), weborrow from adversarial auto-encoders (Makhzaniet al., 2015) and cycle GAN (Zhu et al., 2017)and introduce a third component – the discrimina-tor – to regularize the generator’s output word se-quence. The discriminator and the generator forma generative adversarial network (GAN) (Good-fellow et al., 2014). The discriminator discrim-inates between the generator output and human-written sentences, and the generator produces out-put as similar as possible to human-written sen-tences to confuse the discriminator. With the GANframework, the discriminator teaches the genera-tor how to create human-like summary sentencesas a latent representation. However, due to thenon-differential property of discrete distributions,generating discrete distributions by GAN is chal-lenging. To tackle this problem, in this work, weproposed a new kind of method on language gen-eration by GAN.

By achieving unsupervised abstractive textsummarization, machine is able to unsupervisedlyextract the core idea of the documents. This ap-proach has many potential applications. For exam-ple, the output of the generator can be used for thedownstream tasks like document classification andsentiment classification. In this study, we evaluatethe results on an abstractive text summarizationtask. The output word sequence of the generatoris regarded as the summaries of the input text. Themodel is learned from a set of documents with-out summaries. As most documents are not pairedwith summaries, for example the movie reviews orlecture recordings, this technique makes it possi-ble to learn summarizer to generate summaries forthese documents. The results show that the gener-ator generates summaries with reasonable qualityon both English and Chinese corpora.

Figure 1: Proposed model. Given long text, thegenerator produces a shorter text as a summary.The generator is learned by minimizing the recon-struction loss together with the reconstructor andmaking discriminator regard its output as human-written text.

2 Related WorkAbstractive Text SummarizationRecent model architectures for abstractive textsummarization basically use the sequence-to-sequence (Sutskever et al., 2014) framework incombination with various novel mechanisms. Onepopular mechanism is attention (Bahdanau et al.,2015), which has been shown helpful for summa-rization (Nallapati et al., 2016; Rush et al., 2015;Chopra et al., 2016). It is also possible to directlyoptimize evaluation metrics such as ROUGE (Lin,2004) with reinforcement learning (Ranzato et al.,2016; Paulus et al., 2017; Bahdanau et al., 2016).The hybrid pointer-generator network (See et al.,2017) selects words from the original text with apointer (Vinyals et al., 2015) or from the wholevocabulary with a trained weight. In order to elim-inate repetition, a coverage vector (Tu et al., 2016)can be used to keep track of attended words, andcoverage loss (See et al., 2017) can be used toencourage model focus on diverse words. Whilemost papers focus on supervised learning withnovel mechanisms, in this paper, we explore un-supervised training models.

GAN for Language GenerationIn this paper, we borrow the idea of GAN to makethe generator output human-readable. The majorchallenge in applying GAN to sentence genera-tion is the discrete nature of natural language. Togenerate a word sequence, the generator usuallyhas non-differential parts such as argmax or othersample functions which cause the original GAN tofail.

In (Gulrajani et al., 2017), instead of feeding adiscrete word sequence, the authors directly feedthe generator output layer to the discriminator.This method works because they use the earthmover’s distance on GAN as proposed in (Ar-jovsky et al., 2017), which is able to evaluate thedistance between a discrete and a continuous dis-tribution. SeqGAN (Yu et al., 2017) tackles thesequence generation problem with reinforcementlearning. Here, we refer to this approach as ad-versarial REINFORCE. However, the discrimina-tor only measures the quality of whole sequence,and thus the rewards are extremely sparse and therewards assigned to all the generation steps are allthe same. MC search (Yu et al., 2017) is proposedto evaluate the approximate reward at each timestep, but this method suffers from high time com-plexity. Following this idea, (Li et al., 2017) pro-poses partial evaluation approach to evaluate the

expected reward at each time step. In this pa-per, we propose the self-critical adversarial RE-INFORCE algorithm as another way to evaluatethe expected reward at each time step. The per-formance between original WGAN and proposedadversarial REINFORCE is compared in experi-ment.

3 Proposed MethodThe overview of the proposed model is shown inFig. 2. The model is composed of three com-ponents: generator G, discriminator D, and re-constructor R. Both G and R are seq2seq hy-brid pointer-generator networks (See et al., 2017)which can decide to copy words from encoder in-put text via pointing or generate from vocabulary.They both take a word sequence as input and out-put a sequence of word distributions. Discrimina-tor D, on the other hand, takes a sequence as inputand outputs a scalar. The model is learned from aset of documents x and human-written sentencesyreal.

To train the model, a training documentx = {x1, x2, ..., xt, ..., xT }, where xt rep-resents a word, is fed to G, which outputsa sequence of word distributions G(x) ={y1, y2, ..., yn, ..., yN}, where yn is a distributionover all words in the lexicon. Then we sample aword ysn from each distribution yn, and a word se-quence ys = {ys1, ys2, ..., ysN} is obtained accord-ing to G(x). We feed the sampled word sequenceys to reconstructor R, which outputs another se-quence of word distributions x̂. The reconstructorR reconstructs the original text x from ys. Thatis, we seek an output of reconstructor x̂ that is asclose to the original text x as possible; hence theloss for training the reconstructor,Rloss, is definedas

Rloss =

K∑k=1

ls(x, x̂), (1)

where the reconstruction loss ls(x, x̂) is the cross-entropy loss computed between the reconstructoroutput sequence x̂ and the source text x, or thenegative conditional log-likelihood of source textx given word sequence ys sampled from G(x).The reconstructor output sequence x̂ is teacher-forced by source text x. The subscript s in ls(x, x̂)indicates that x̂ is reconstructed from ys. K is thenumber of training documents, and (1) is the sum-mation of the cross-entropy loss over all the train-ing documents x.

In the proposed model, the generator G and re-constructorR form an auto-encoder. However, thereconstructor R does not directly take the genera-tor output distributionG(x) as input 1. Instead, thereconstructor takes a sampled discrete sequence ys

as input. Due to the non-differentiable property ofdiscrete sequences, we apply the REINFORCE al-gorithm, which is described in Section 4.

In addition to reconstruction, we need the dis-criminator D to discriminate between the real se-quence yreal and the generated sequence ys to reg-ularize the generated sequence satisfying the sum-mary distribution. D learns to give yreal higherscores while giving ys lower scores. The loss fortraining the discriminator D is denoted as Dloss;this is further described in Section 5.G learns to minimize the reconstruction error

Rloss, while maximizing the loss of the discrimi-natorD by generating a summary sequence ys thatcannot be differentiated by D from the real thing.The loss for the generator Gloss is

Gloss = αRloss −D′loss (2)

where D′loss is highly related to Dloss – but notnecessary the same2 – and α is a hyper-parameter.After obtaining the optimal generator by minimiz-ing (2), we use it to generate summaries.

Generator G and discriminator D together forma GAN. We use two different adversarial trainingmethods to train D and G; as shown in Fig. 2,these two methods have their own discriminators1 and 2. Discriminator 1 takes the generator out-put layer G(x) as input, whereas discriminator 2takes the sampled discrete word sequence ys asinput. The two methods are described respectivelyin Sections 5.1 and 5.2.

4 Minimizing Reconstruction ErrorBecause discrete sequences are non-differentiable,we use the REINFORCE algorithm. The gener-ator is seen as an agent whose reward given thesource text x is −ls(x, x̂). Maximizing the re-ward is equivalent to minimizing the reconstruc-tion loss Rloss in (1). However, the reconstructionloss varies widely from sample to sample, and thusthe rewards to the generator are not stable either.Hence we add a baseline to reduce their difference.

1We found that if the reconstructor R directly takes G(x)as input, the generator G learns to put the information aboutthe input text in the distribution of G(x), making it difficultto sample meaningful sentences from G(x).

2D′loss has different formulations in different approaches.This will be clear in Sections 5.1 and 5.2.

Figure 2: Architecture of proposed model. The generator network and reconstructor network are aseq2seq hybrid pointer-generator network, but for simplicity, we omit the pointer and the attention parts.We apply self-critical sequence training (Rennieet al., 2017); the modified reward rR(x, x̂) fromreconstructor R with the baseline for the genera-tor is

rR(x, x̂) = −ls(x, x̂)− (−la(x, x̂)− b) (3)

where −la(x, x̂) − b is the baseline. la(x, x̂)is also the same cross-entropy reconstructionloss as ls(x, x̂), except that x̂ is obtained fromya instead of ys. ya is a word sequence{ya1 , ya2 , ..., yan, ..., yaN}, where yan is selected usingthe argmax function from the output distributionof generator yn. As in the early training stage,the sequence ys barely yields higher reward thansequence ya, to encourage exploration we intro-duce the second baseline score b, which gradu-ally decreases to zero. Then, the generator is up-dated using the REINFORCE algorithm with re-ward rR(x, x̂) to minimize Rloss.

5 GAN Training

With adversarial training, the generator learns toproduce sentences as similar to the human-writtensentences as possible. Here, we conduct experi-ments on two kinds of methods of language gen-eration with GAN. In Section 5.1 we directlyfeed the generator output probability distributionsto the discriminator and use a Wasserstein GAN(WGAN) with a gradient penalty. In Section 5.2,we explore adversarial REINFORCE, which feedssampled discrete word sequences to the discrim-inator and evaluates the quality of the sequencefrom the discriminator for use as a reward signalto the generator.

5.1 Method 1: Wasserstein GANIn the lower left of Fig. 2, the discriminator modelof this method is shown as discriminator1 D1.D1 is a deep CNN with residual blocks, whichtakes a sequence of word distributions as input andoutputs a score. The discriminator loss Dloss is

Dloss =1

K

K∑k=1

D1(G(x(k)))− 1

K

K∑k=1

D1(yreal(k))

+β11

K

K∑k=1

(∆yi(k)D1(yi(k))− 1)2,

where K denotes the number of training exam-ples in a batch, and k denotes the k-th exam-ple. The last term is the gradient penalty (Gul-rajani et al., 2017). We interpolate the genera-tor output layer G(x) and the real sample yreal,and apply the gradient penalty to the interpolatedsequence yi. β1 determines the gradient penaltyscale. In Equation (2), for WGAN, the generatormaximizes D′loss:

D′loss =1

K

K∑k=1

D1(G(x(k))). (4)

5.2 Method 2: Self-Critic AdversarialREINFORCE

In this section, we describe in detail the pro-posed adversarial REINFORCE method. The coreidea is we use the LSTM discriminator to evalu-ate the current quality of the generated sequence{ys1, ys2, ..., ysi } at each time step i. The generatorknows that compared to the last time step, as thegenerated sentence either improves or worsens, it

can easily find the problematic generation step ina long sequence, and thus fix the problem easily.

5.2.1 Discriminator 2As shown in Fig. 2, the discriminator2 D2 is aunidirectional LSTM network which takes a dis-crete word sequence as input. At time step i,given input word ysi it predicts the current score sibased on the sequence {y1, y2, ..., yi}. The score isviewed as the quality of the current sequence. Anexample of discriminator regularized by weightclipping(Arjovsky et al., 2017) is shown in Fig. 3.

Figure 3: When the second arrested appears, as thesentence becomes ungrammatical, the discrimina-tor determines that this example comes from thegenerator. Hence, after this time-step, it outputslow scores.

In order to compute the discriminator lossDloss, we sum the scores {s1, s2, ..., sN} of thewhole sequence ys to yield

D2(ys) =

1

N

N∑n=1

sn.

where N denotes the generated sequence length.Then, the loss of discriminator is

Dloss =1

K

K∑k=1

D2(ys(k))− 1

K

K∑k=1

D2(yreal(k))

+β21

K

K∑k=1

(∆yi(k)D2(yi(k))− 1)2,

Similar to previous section, the last term is gra-dient penalty term. With the loss mentionedabove, the discriminator attempts to quickly deter-mine whether the current sequence is real or fake.The earlier the timestep discriminator determineswhether the current sequence is real or fake, thelower its loss.

5.2.2 Self-Critical GeneratorSince we feed a discrete sequence ys to the dis-criminator, the gradient from the discriminatorcannot directly back-propagate to the generator.Here, we use the policy gradient method. Attimestep i, we use the i − 1 timestep score si−1from the discriminator as its self-critical baseline.The reward rDi evaluates whether the quality of se-

quence in timestep i is better or worse than that intimestep i− 1. The generator reward rDi from D2

is

rDi =

{si if i = 1

si − si−1 otherwise.

However, some sentences may be judged as badsentences at the previous timestep, but at latertimesteps judged as good sentences, and viceversa. Hence we use the discounted expected re-ward d with discount factor γ to calculate the dis-counted reward di at time step i as

di =N∑j=i

γj−irDj .

The adversarial REINFORCE score related to dis-criminator D′loss in (2) is

D′loss = −Eysi∼pG(ysi |ys1,...,ysi−1,x)[di]. (5)

We use the likelihood ratio trick to approximatethe gradient to minimize (5).

6 ExperimentOur model was evaluated on the English/ChineseGigaword datasets and CNN/Daily Mail dataset.In Section 6.1,6.2 and 6.4, the experiments wereconducted on English Gigaword, while the experi-ments were conducted on CNN/Daily Mail datasetand Chinese Gigaword dataset respectively in Sec-tions 6.3 and 6.6. We used ROUGE(Lin, 2004) asour evaluation metric. During testing, when us-ing the generator to generate summaries, we usedbeam search with beam size=5, and we eliminatedrepetition. We provide the details of the imple-mentation and corpus re-processing respectivelyin Appendix A and B.

Before jointly training the whole model, wepre-trained the three major components – gener-ator, discriminator, and reconstructor – separately.First, we pre-trained the generator in an unsuper-vised manner so that the generator would be ableto somewhat grasp the semantic meaning of thesource text. The details of the pre-training arein Appendix C. We pre-trained the discriminatorand reconstructor respectively with the pre-trainedgenerator’s output to ensure that these two criticnetworks provide good feedback to the generator.

6.1 English GigawordThe English Gigaword is a sentence summariza-tion dataset which contains the first sentence ofeach article and its corresponding headlines. Thepreprocessed corpus contains 3.8M training pairsand 400K validation pairs. We trained our model

Task Labeled Methods R-1 R-2 R-L

(A)Supervised 3.8M

(A-1)Supervised training on generator 33.19 14.21 30.50(A-2) (Rush et al., 2015)= 29.76 11.88 26.96

(A-3) (Chopra et al., 2016)= 33.78 15.97 31.15(A-4) (Zhou et al., 2017)= 36.15 17.54 33.63

(B) Trivial baseline 0 (B-1) Lead-8 21.86 7.66 20.45

(C) Unsupervised 0(C-1) Pre-trained generator 21.26 5.60 18.89

(C-2) WGAN 28.09 9.88 25.06(C-3) Adversarial REINFORCE 28.11 9.97 25.41

(D) Semi-supervised

10K(D-1) WGAN 29.17 10.54 26.72

(D-2) Adversarial REINFORCE 30.01 11.57 27.61

500K(D-3)(Miao and Blunsom, 2016)= 30.14 12.05 27.99

(D-4) WGAN 32.50 13.65 29.67(D-5) Adversarial REINFORCE 33.33 14.18 30.48

1M(D-6)(Miao and Blunsom, 2016)= 31.09 12.79 28.97

(D-7) WGAN 33.18 14.19 30.69(D-8) Adversarial REINFORCE 34.21 15.16 31.64

0(E-1) Pre-trained generator 21.49 6.28 19.34

(E) Transfer learning (E-2) WGAN 25.11 7.94 23.05(E-3) Adversarial REINFORCE 27.15 9.09 24.11

Table 1: Average F1 ROUGE scores on English Gigaword. R-1, R-2 and R-L refers to ROUGE 1,ROUGE 2 and ROUGE L respectively. Results marked with = are obtained from corresponding papers.In part (A), the model was trained supervisedly. In row (B-1), we select the article’s first eight wordsas its summary. Part (C) are the results obtained without paired data. In part (D), we trained our modelwith few labeled data. In part (E), we pre-trained generator on CNN/Diary and used the summaries fromCNN/Diary as real data for the discriminator.

on part of or fully unparalleled data on 3.8M train-ing set. To have fair comparison with previousworks, the following experiments were evaluatedon the 2K testing set same as (Rush et al., 2015;Miao and Blunsom, 2016). We used the sentencesin article headlines as real data for discriminator3.As shown in the following experiments, the head-lines can even come from another set of docu-ments not related to the training documents.

The results on English Gigaword are shown inTable 1. WGAN and adversarial REINFORCErefer to the adversarial training methods men-tioned in Sections 5.1 and 5.2 respectively. Re-sults trained by full labeled data are in part (A).In row (A-1), We trained our generator by su-pervised training. Compared with the previouswork (Zhou et al., 2017), we used simpler modeland smaller vocabulary size. We did not try toachieve the state-of-the-art results because the fo-cus of this work is unsupervised learning, and theproposed approach is independent to the summa-

3Instead of using general sentences as real data for dis-criminator, we chose sentences from headlines because theyhave their own unique distribution.

rization models used. In row (B-1), we simplytook the first eight words in a document as its sum-mary.

The results for the pre-trained generator withmethod mentioned in Appendix.C is shown inrow (C-1). In part (C), we directly took the sen-tences in the summaries of Gigaword as the train-ing data of discriminator. Compared with the pre-trained generator and the trivial baseline , the pro-posed unsupervised approach (rows (C-2) and (C-3)) showed good improvement. In Fig. 4, we pro-vide a real example. More examples can be foundin the Appendix.D.

6.2 Semi-Supervised LearningIn semi-supervised training, generator was pre-trained with few available labeled data. Duringtraining, we conducted teacher-forcing with la-beled data on generator every several unsupervisedupdates. With 10K, 500K and 1M labeled data,the teacher-forcing was conducted every 25, 5 and3 unsupervised updates, respectively. In teacher-forcing, given source text as input, the generatorwas teacher-forced to predict the human-writtensummary of source text. Teacher-forcing can be

regarded as regularization of unsupervised train-ing that prevents generator from producing unrea-sonable summaries of source text. We found that ifwe teacher-forced generator too frequently, gener-ator would overfit on training data since we onlyused very few labeled data on semi-supervisedtraining.

The performance of semi-supervised model inEnglish Gigaword regarding available labeled datais shown in Table 1 part (D). We compared ourresults with (Miao and Blunsom, 2016) whichwas the previous state-of-the-art method on semi-supervised summarization task under the sameamount of labeled data. With both 500K and 1Mlabeled data, our method performed better. Fur-thermore, with only 1M labeled data, using ad-versarial REINFORCE even outperformed super-vised training in Table 1 (A-1) with the whole3.8M labeled data.

Figure 4: Real examples with methods referred inTable 1. The proposed methods generated sum-maries that grasped the core idea of the articles.

6.3 CNN/Daily Mail dataset

The CNN/Daily Mail dataset is a long text sum-marization dataset which is composed of news ar-ticles paired with summaries. We evaluated ourmodel on this dataset because it’s a popular bench-mark dataset, and we want to know whether theproposed model works on long input and longoutput sequences. The details of corpus pre-processing can be found in Appendix.B . In un-supervised training, to prevent the model from di-rectly matching the input articles to its correspond-ing summaries, we split the training pairs into twoequal sets, one set only supplied articles and theother set only supplied summaries.

The results are shown in Table 2. For super-vised approaches in part (A), although our seq2seqmodel was similar to (See et al., 2017), due tothe smaller vocabulary size (we didn’t tackle out-of-vocabulary words), simpler model architecture,shorter output length of generated summaries,there was a performance gap between our modeland the scores reported in (See et al., 2017). Com-pared to the lead-3 baseline in part (B) which tookthe first three sentences of articles as summaries,the seq2seq models fell behind. That was be-cause news writers often put the most importantinformation in the first few sentences, and thuseven the best abstractive summarization modelonly slightly beat the lead-3 baseline on ROUGEscores. However, during pre-training or trainingwe didn’t make assumption that the most impor-tant sentences are in first few sentences.

We observed that our unsupervised modelyielded decent ROUGE-1 score, but it yieldedlower ROUGE-2 and ROUGE-L score. That wasprobably because the length of our generated se-quence was shorter than ground truth, and ourvocabulary size was small. Another reason wasthat the generator was good at selecting the mostimportant words from the articles, but sometimesfailed to combine them into reasonable sentencesbecause it’s still difficult for GAN to generatelong sequence. In addition, since the reconstruc-tor only evaluated the reconstruction loss of wholesequence, as the generated sequence became long,the reconstruction reward for generator becameextremely sparse. However, compared to pre-trained generator (rows (C-2), (C-3) v.s. (C-1)),our model still enhanced the ROUGE score. Anreal example of generated summary can be foundat Appendix.D Fig.11 .

6.4 Transfer LearningThe experiments conducted up to this point re-quired headlines unpaired to the documents butin the same domain to train discriminator. Inthis subsection, we generated the summaries fromEnglish Gigaword (target domain), but the sum-maries for discriminator were from CNN/DailyMail dataset (source domain).

The results of transfer learning are shown in Ta-ble. 1 part (E). Table 1 (E-1) is the result of pre-trained generator and the poor pre-training resultindicates that the data distributions of two datasetsare quite different. We find that using sentencesfrom another dataset yields lower ROUGE scores

Methods R-1 R-2 R-L

(A)Supervised(A-1)Supervised training on our generator 38.89 13.74 29.42(A-2) (See et al., 2017)= 39.53 17.28 36.38

(B)Lead-3 baseline (See et al., 2017)= 40.34 17.70 36.57

(C) Unsupervised(C-1) Pre-trained generator 29.86 5.14 14.66(C-2) WGAN 35.14 9.43 21.04(C-3) Adversarial REINFORCE 35.51 9.38 20.98

Table 2: F1 ROUGE scores on CNN/Diary Mail dataset. In row (B), the first three sentences were taken assummaries. Part (C) are the results obtained without paired data. The results with symbol = are directlyobtained from corresponding papers.

Methods R-1 R-2 R-L(A) Training with paired data (supervised) 49.62 34.10 46.42(B)Lead-3 baseline (See et al., 2017) 30.08 18.24 27.74

(C) Unsupervised(C-1) Pre-trained generator 28.36 16.73 26.48(C-2) WGAN 38.15 24.60 35.27(C-3) Adversarial REINFORCE 41.25 26.54 37.76

Table 3: F1 ROUGE scores on Chinese Gigaword. In row (B), we selected the article’s first fifteen wordsas its summary. Part (C) are the results obtained without paired data.

on the target testing set (parts (E) v.s. (C)) due tothe mismatch word distributions between the sum-maries of the source and target domains. How-ever, the discriminator still regularizes the gener-ated word sequence. After unsupervised training,the model enhanced the ROUGE scores of the pre-trained model (rows (E-2), (E-3) v.s. (E-1)) and italso surpassed the trivial baselines in part (B).

6.5 GAN TrainingIn this section, we discuss the performance oftwo GAN training methods. As shown in theTable 1, in English Gigaword, our proposed ad-versarial REINFORCE method performed betterthan WGAN. However, in Table 2, our proposedmethod slightly outperformed by WGAN. In addi-tion, we find that when training with WGAN, con-vergence is faster. Because WGAN directly eval-uates the distance between the continuous distri-bution from generator and the discrete distributionfrom real data, the distribution was sharpened atan early stage in training. This caused generator toconverge to a relatively poor place. On the otherhand, when training with REINFORCE, genera-tor keeps seeking the network parameters that canbetter fool discriminator. We believe that trainingGAN on language generation with this method isworth exploring.

6.6 Chinese GigawordThe Chinese Gigaword is a long text summariza-tion dataset composed of paired headlines and

news. Unlike the input news in English Gigaword,the news in Chinese Gigaword consists of sev-eral sentences. The results are shown in Table 3.Row (A) lists the results using 1.1M document-summary pairs to directly train the generator with-out the reconstructor and discriminator: this is theupper bound of the proposed approach. In row (B),we simply took the first fifteen words in a docu-ment as its summary. The number of words waschosen to optimize the evaluation metrics. Part(C) are the results obtained in the unsupervisedscenario without paired data. The discriminatortook the summaries in the training set as real data.We show the results of the pre-trained generator inrow (C-1); rows (C-2) and (C-3) are the results forthe two GAN training methods respectively. Wefind that despite the performance gap between theunsupervised and supervised methods (rows (C-2),(C-3) v.s. (A)), the proposed method yielded muchbetter performance than the trivial baselines (rows(C-2), (C-3) v.s. (B)).

7 Conclusion and Future WorkUsing GAN, we propose a model that encodes textas a human-readable summary, learned withoutdocument-summary pairs. Promising results areobtained on both Chinese and English corpora. Infuture work, we hope to use extra discriminatorsto control the style and sentiment of the generatedsummaries.

ReferencesMartin Arjovsky, Soumith Chintala, and Lon Bot-

tou. 2017. Wasserstein gan. arXiv preprintarXiv:1701.07875.

Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu,Anirudh Goyal, Ryan Lowe, Joelle Pineau, AaronCourville, and Yoshua Bengio. 2016. An actor-criticalgorithm for sequence prediction. arXiv preprintarXiv:1607.07086.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. ICLR.

Sumit Chopra, Michael Auli, and Alexander M. Rush.2016. Abstractive sentence summarization with at-tentive recurrent neural networks. HLT-NAAC.

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014. Generative ad-versarial networks. arXiv preprint arXiv:1406.2661.

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vin-cent Dumoulin, and Aaron Courville. 2017. Im-proved training of wasserstein gans. arXiv preprintarXiv:1704.00028.

Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov,Richard S. Zemel, Antonio Torralba, Raquel Urta-sun, and Sanja Fidler. 2015. Skip-thought vectors.NIPS.

Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. 2015.A hierarchical neural autoencoder for paragraphsand documents. ACL.

Jiwei Li, Will Monroe, Tianlin Shi, Sbastien Jean, AlanRitter, and Dan Jurafsky. 2017. Adversarial learn-ing for neural dialogue generation. arXiv preprintarXiv:1701.06547.

Chin-Yew Lin. 2004. Rouge: A package for auto-matic evaluation of summaries. In Text summariza-tion branches out: ACL workshop.

Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly,Ian Goodfellow, and Brendan Frey. 2015. Adversar-ial autoencoders. arXiv preprint arXiv:1511.05644.

Yishu Miao and Phil Blunsom. 2016. Language as alatent variable: Discrete generative models for sen-tence compression. EMNLP.

Ramesh Nallapati, Bowen Zhou, Cicero Nogueira dossantos, Caglar Gulcehre, and Bing Xiang. 2016.Abstractive text summarization using sequence-to-sequence rnns and beyond. EMNLP.

Romain Paulus, Caiming Xiong, and Richard Socher.2017. A deep reinforced model for abstractive sum-marization. arXiv preprint arXiv:1705.04304.

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli,and Wojciech Zaremba. 2016. Sequence level train-ing with recurrent neural networks. ICLR.

Steven J. Rennie, Etienne Marcheret, Youssef Mroueh,Jarret Ross, and Vaibhava Goel. 2017. Self-criticalsequence training for image captioning. CVPR.

Alexander M. Rush, Sumit Chopra, and Jason Weston.2015. A neural attention model for abstractive sen-tence summarization. EMNLP.

Abigail See, Peter J. Liu, and Christopher D. Manning.2017. Get to the point: Summarization with pointer-generator networks. ACL.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural net-works. NIPS.

Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu,and Hang Li. 2016. Modeling coverage for neuralmachine translation. ACL.

Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.2015. Pointer networks. NIPS.

Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu.2017. Seqgan: Sequence generative adversarial netswith policy gradient. AAAI.

Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou.2017. Selective encoding for abstractive sentencesummarization. ACL.

Jun-Yan Zhu, Taesung Park, Phillip Isola, andAlexei A. Efros. 2017. Unpaired image-to-imagetranslation using cycle-consistent adversarial net-works. arXiv preprint arXiv:1703.10593.

A Implementation

Network Architecture. The model architectureof generator and reconstructor is almost the sameexcept the length of input and output sequence.We adapt model architecture for our generator andreconstructor from (See et al., 2017) who usedhybrid-pointer network for text summarization.The hybrid-pointer networks of generator and re-constructor are all composed of two one-layer uni-directional LSTMs as its encoder and decoder, re-spectively, with a hidden layer size of 600. Sincewe use two kinds of methods on adversarial train-ing, there are two discriminators with differentmodel architecture. In the Section 5.1, the dis-criminator is composed of four residual blockswith 512 hidden dimensions. While in Section 5.2,we use only one layer unidirectional LSTM with ahidden size of 512 as our discriminator.Details of Training. In all experiments except inSection 6.4 , we set the weight α in (2) control-ling Rloss to 25. In Section 6.4, to prevent genera-tor from overfitting to sentences from CNN/DailyMail summary, we set the weight α to 50 whichwas larger than other experiments. We find thatthe if the value of α is too large, generator willstart to generate output unlike human-written sen-tences. On the other hand, if the value of α is toosmall, the sentences generated by generator willsometimes become unrelated to input text of gen-erator. For all the experiments, the baseline b in(3) gradually decreases from 0.25 to zero within10000 updates on generator.

We set the weight β1 of the gradient penaltyin Section 5.1 to 10, and used RMSPropOpti-mizer with a learning rate of 0.00001 and 0.001on the generator and discriminator, respectively.In Section 5.2.1, the weight β2 of gradient penaltyterms was 1.0, and used RMSPropOptimizer witha learning rate of 0.00001 and 0.001 on the gener-ator and discriminator, respectively. It’s also feasi-ble to apply weight clipping in discriminator train-ing, but the performance of gradient penalty trickwas better.

B Corpus Pre-processing

• English Gigaword: We used the script of(Rush et al., 2015) to construct our trainingand testing datasets. The vocabulary size wasset to 15K in all experiments.

• CNN/Diary Mail: We obtained 287227 train-

ing pairs, 13368 validation pairs and 11490testing pairs identical to (See et al., 2017)by using the scripts provided by (See et al.,2017). To make our model easier to train,during training and testing time, we truncatedinput articles to 250 tokens (original articleshas 781 tokens on average) and restricted thelength of generator output summaries (origi-nal summaries has 56 tokens on average) to50 tokens. The vocabulary size was set to15k.

• Chinese Gigaword: The Chinese Gigawordis a long text summarization dataset which iscomposed of 2.2M paired data of headlinesand news. We preprocessed the raw data asfollowing. First, we selected the 4K most fre-quent Chinese characters to form our vocabu-lary. We filtered out headline-news pairs withexcessively long or short news segments, orthat contained too many out-of-vocabularyChinese characters, yielding 1.1M headline-news pairs from which we randomly selected5K headline-news pairs as our testing set, 5Kheadline-news pairs as our validation set, andthe remaining pairs as our training set. Dur-ing training and testing, the generator tookthe first 80 Chinese characters of the sourcetext as input.

C Model Pre-training

As we found that the different pre-training meth-ods for the generator influenced final performancedramatically in all of the experiments, we felt itwas important to find a proper unsupervised pre-training method to help the machine grasp se-mantic meaning. The summarization tasks ontwo datasets is different: One is sentence summa-rization, while the other is long text summariza-tion. Therefore, we used the different pre-trainingstrategies on two datasets described below.

• CNN/Diary Mail: The CNN/Diary Mail isa long text summarization dataset in whichthe source text consists of several sen-tences. Given the previous i − 1 sentencessent0, sent1, ..., senti−1 from the sourcetext, the generator predicted the next foursentences senti, senti+1, .., senti+3 in thesource text as its pre-training target. If morethan 40% of the words in target sentencessenti, senti+1, ..., senti+3 did not appear in

the given text, we filtered out this pre-trainingsample pair. This pre-training method al-lowed the generator to capture the impor-tant semantic meanings of the source text.Although the first few sentences of articlesin CNN/Diary Mail contains the main infor-mation of articles, we hope we can providea more general pre-training method whichdon’t have any assumption of dataset and canbe easily applied to other datasets.

• Chinese Gigaword: The pre-trainingmethod of Chinese Gigaword was similarto CNN/Diary Mail except that generatorpredicted the next sentence instead of nextconsecutive four sentences.

• English/Chinese Gigaword: As the sourcetext of English Gigaword is made up of onlyone sentence, it is not feasible to split the lastsentence from the source text; hence the pre-vious pre-training method on Chinese Giga-word is not appropriate for this dataset. Toproperly initialize the set, we randomly se-lected 6 to 11 consecutive words in the sourcetext, after which we randomly swapped 70%of the words in the source text. Given textwith incorrect word arrangements, the gener-ator predicted the selected words in the cor-rect arrangement. We pre-trained in this waybecause we expect the generator to initializewith a rough language model. In ChineseGigaword we also conducted experiments onpre-training in this manner, but the resultswere not as good as those shown in the part(C) of Table 3. In addition, we also used theretrieved paired data in row (B-1) in Table 1to pre-train the generator in English Giga-word. However, pre-training generator withthis method doesn’t yield results better thanthose in Table 1.

• Transfer Learning: Before unsupervisedtraining, the generator was pre-trained withparalleled data on CNN/Daily Mail dataset.However, the characteristics for two datasetsare different. In English Gigaword, the ar-ticles were short and the summaries consistof only one sentence, while in CNN/DailyMail dataset, the articles were extremely longand summaries consist of several sentences.To overcome these differences, during pre-training time, we took the first 35-45 words

in each CNN/Diary Mail article as genera-tor input, and generator randomly predictedone of the sentences of the article’s summary.In addition, we used the one sentence fromCNN/Diary Mail summaries as real data todiscriminator instead full summaries.

D Examples

Figure 5

Figure 6

Figure 7

Figure 8

Figure 9

Figure 10

Figure 11: An example of generated summary of CNN/Diary Mail.

Documents

Learning to Encode Text as Human-Readable Summaries using ...speech.ee.ntu.edu.tw/~tlkagk/paper/learning-encode-text.pdf · is shorter than the source text; in order to recon-struct