Download pdf - TLDR: Extreme Summarization of Scientific Documents › anthology › 2020.findings-emnlp.428.pdf4768 Numberof Avg. words Avg. words Compression %no vel Dataset Multi-target documents

Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4766–4777November 16 - 20, 2020. c©2020 Association for Computational Linguistics

4766

TLDR:�Extreme�Summarization�of�Scientific�Documents�

Isabel�Cachola† Kyle�Lo† Arman�Cohan† Daniel�S.�Weld†‡

†Allen�Institute�for�AI�‡Paul�G.�Allen�School�of�Computer�Science�&�Engineering,�University�of�Washington�

{isabelc,kylel,armanc,danw}@allenai.org

Abstract�

We� introduce� TLDR� generation,� a� new� form�of� extreme� summarization,� for� scientific� pa-pers.� TLDR� generation� involves� high� source�compression� and� requires� expert� background�knowledge� and� understanding� of� complex�domain-specific�language.� To�facilitate�study�on� this� task,� we� introduce� SCITLDR,� a� new�multi-target�dataset�of�5.4K�TLDRs�over�3.2K�papers.�SCITLDR�contains both author-written�and� expert-derived� TLDRs,� where� the� latter�are� collected� using� a� novel� annotation� pro-tocol� that� produces� high-quality� summaries�while�minimizing�annotation�burden.� We�pro-pose� CATTS,� a� simple� yet� effective� learn-ing� strategy� for� generating� TLDRs� that� ex-ploits� titles� as� an� auxiliary� training� signal.�CATTS� improves� upon� strong� baselines� un-der�both�automated�metrics�and�human�evalua-tions.� Data�and�code�are�publicly�available�at�https://github.com/allenai/scitldr.�

1� Introduction�

We�introduce�TLDR1� generation�for�scientific�pa-pers.�An�alternative�to�abstracts,�TLDRs�focus�on�the�key�aspects�of�the�paper,�such�as�its�main�con-tributions,�eschewing�nonessential�background�or�methodological�details.�Given�the�increasing�pace�of�publication�(Van�Noorden,�2014)�and�resulting�difficulty�in�keeping�up�with�the�literature,�TLDRs�can�enable�readers�to�quickly�discern�a�paper’s�key�points�and�decide�whether�it’s�worth�reading.�The�goal�of�existing�work�in�summarization�of�scientific�documents�is�to�generate�abstracts�or�provide�com-plimentary�summaries�to�abstracts.�(Collins�et�al.,�2017;�Cohan�et�al.,�2018;�Chandrasekaran�et�al.,�2019;�Yasunaga�et�al.,�2019).� In�contrast,�TLDR�

1TLDR�is�an�acronym�that�stands�for�“too�long;�didn’t�read,”�which�is�often�used�in�online�informal�discussion�(e.g.,�Twitter�or�Reddit)�about�scientific�papers.�For�visual�clarity,�we�omit�the�semi-colon.�

Abstract While many approaches to make neural networksmore fathomable have been proposed, they are restricted to in-terrogating the network with input data. [...] In this work, wepropose neural persistence, a complexity measure for neural net-work architectures based on topological data analysis on weightedstratified graphs. [...]

Intro [...] In this work, we present the following contribu-tions: We introduce neural persistence, a novel measure for char-acterizing the structural complexity of neural networks that canbe e�ciently computed. [...]

Conclusion [...] However, this did not yield an early stop-ping measure because it was never triggered, thereby suggestingthat neural persistence captures salient information that wouldotherwise be hidden among all the weights of a network [...]

TLDR We develop a new topological complexity measurefor deep neural networks and demonstrate that it captures theirsalient properties.

1

Figure�1:� An�example� TLDR� of�a� scientific�paper.� A�TLDR� is�typically�composed�of�salient�information�(in-dicated�by�colored�spans)�found�in� the�abstract,� intro,�and�conclusion�sections�of�a�paper.�

generation�seeks�to�produce�an�extreme�(single�sen-tence)�summary�(Narayan�et�al.,�2018)�given�the�entire�paper.� Further,�TLDR� generation�is�a�chal-lenging�natural�language�generation�task.�Writing�a�TLDR�of�a�scientific�paper�requires�expert�back-ground�knowledge�and�understanding�of�complex�domain-specific�language�to�identify�the�salient�as-pects�of�the�paper,�while�maintaining�faithfulness�to�the�source�and�correctness�of�the�written�summary.�An�example�TLDR�is�provided�in�Figure�1.�

To�facilitate�the�study�of�TLDR�generation,�we�introduce�SCITLDR,�a�new�dataset�of�5,411�TLDRs�of�computer�science�papers.�SCITLDR�is�built�from�a�combination�of�TLDRs�written�by�authors�of�sub-missions�on�OpenReview2� and�TLDRs�derived�by�a�novel�annotation�protocol�that�asks�domain�experts�to�rewrite�peer�review�comments�for�that�submis-sion.�Having�multiple�gold�summaries�per�paper�is�especially�important�for�evaluation�when�there�is�

2https://openreview.net/�

https://github.com/allenai/scitldrhttps://openreview.net/

4767

variability�in�human-written�gold�summaries�(Zech-ner,�1996;�Harman�and�Over,�2004).�

In� addition� to� establishing� strong� extractive�and� abstractive� summarization� baselines� using�Transformer-based� (Vaswani� et� al.,� 2017)� mod-els,�we�present�CATTS�(Controlled�Abstraction�for�TLDRs�with�Title�Scaffolding),�a�simple�yet�effec-tive�learning�strategy�for�TLDR�generation.�CATTS�incorporates�ideas�from�scaffold�tasks�for�multitask�learning�(Swayamdipta�et�al.,�2018a;�Cohan�et�al.,�2019)�and�control�codes�in�conditional�language�generation�(Keskar�et�al.,�2019)�to�address�the�prob-lem�of�data�scarcity�in�the�highly-specialized�sci-entific�domain.�In�particular,�CATTS�exploits�titles�as� an� auxiliary,� naturally-occurring� training� sig-nal�by�training�the�model�to�generate�both�titles�and�TLDRs�indicated�by�control�codes.� We�show�that�CATTS�applied�to�BART�(Lewis�et�al.,�2020),�a�state-of-the-art�summarization�model,�results�in�performance�improvement�in�both�automated�met-rics�and�human�evaluation.�Our�contributions�are�summarized�below:�

1.� We�introduce�TLDR� generation,�a�new�form�of� extreme� summarization,� for� scientific� papers.�With�extensive�analysis�of�properties�of�TLDRs,�we�provide�insight�into�the�types�of�information�and�amount�of�variability�in�human-written�TLDRs.�

2.� We� release� SCITLDR,� a� new� multi-target�dataset�of�5,411�TLDRs�over�3,229�scientific�papers.�SCITLDR�contains�both�author-written�and�expert-derived�TLDRs,�where�the�latter�are�collected�us-ing�a�novel�annotation�protocol�that�produces�high-quality�summaries�while�avoiding�the�burden�of�reading�the�full�paper.�

3.� We�establish�strong�baselines�on�SCITLDR�and�improve�them�with�CATTS,�a�simple�yet�effective�learning�strategy�for�generating�TLDRs�that�uses�titles�as�an�auxiliary�training�signal.�

4.� We�perform�extensive�analysis�and�human�eval-uation� of� system-generated� TLDRs,� focusing� on�informativeness�and�factual�correctness.�

2� Dataset�construction�

Overview� We�introduce�SCITLDR,�a�new�multi-target�dataset�of�5,411�TLDRs�over�3,229�scientific�papers�in�the�computer�science�domain.3� The�train-ing�set�contains�1,992�papers,�each�with�a�single�gold�TLDR.�The�dev�and�test�sets�contain�619�and�618�papers�each,�with�1,452�and�1,967�TLDRs,�re-spectively.�This�is�unlike�the�majority�of�existing�

3See�Appendix�Table�9�for�full�venue�breakdown.�

Peer review The paper proposes variance regularizing ad-versarial learning (VRAL), a new method for training GANs.The motivation is to ensure that the gradient for the genera-tor does not vanish. [...] The discriminator itself is trainedthrough two additional meta-discriminators Are the meta-dis-criminators really necessary? Have you tried matching momentsor using other methods [...]

Derived TLDR The paper proposes variance regularizingadversarial learning for training gans to ensure that the gradientfor the generator does not vanish.

1

Figure�2:�Example�of�a�reviewer�comment�rewritten�as�a�TLDR�(best�viewed�in�color).�A�peer�review�comment�often�begins�with�a�summary�of�the�paper�which�anno-tators�use�to�compose�a�TLDR.� Annotators�are�trained�to�preserve�the�original�reviewer’s�wording�when�pos-sible�(indicated�by�colored�spans),�and�to�avoid�using�any�excess�details�or�criticism.�

summarization�datasets�that�assume�only�one�gold�summary�for�a�given�document.�

As�evidenced�by�earlier�work�in�summarization�evaluation�(Cohan�and�Goharian,�2016),�variabil-ity�in�human-written�summaries�(Zechner,�1996;�Harman�and�Over,�2004)�can�negatively�impact�the�reliability�of�automated�summarization�metrics�like�Rouge�(Lin,�2004).4� Considering�only�one�gold�TLDR�for�each�paper�as�a�basis�of�automated�eval-uation�might� result� in� inaccurate� system�quality�assessment�because�content�that�might�appear�in�a�TLDR�can�have�large�variability.� In�addition,�hav-ing�multiple�gold�summaries�for�each�document�en-ables�performing�more�in-depth�analysis�and�thor-ough�evaluation�(Nenkova�and�Passonneau,�2004).�

To�address�this,�SCITLDR�contains�TLDRs�writ-ten� from� the� perspective� of� the� author� (“TLDR-Auth”)�and�TLDRs�written�from�the�perspective�of�the�peer�reviewer(“TLDR-PR”).�We�describe�these�two�types�of�TLDRs�in�the�following�paragraphs.�

Collecting� TLDR-Auth� pairs� Scholar-written�TLDRs�of�scientific�papers�are�available�on�vari-ous�online�platforms.�On�OpenReview.org,�a�pub-licly� available� scientific� reviewing�platform,� au-thors�submit�TLDRs�of�their�papers�that�summarize�the�main�content�for�both�reviewers�and�other�in-terested�scholars.�Scholars�also�share�TLDRs�social�media�platforms,�such�as�Twitter�and�Reddit.�

We�use�the�OpenReview�API5� to�collect�pairs�of�papers�and�author-written�TLDRs,�along�with�the�

4While�Rouge�is�capable�of�handling�multiple�targets�for�a�given�document,�most�summarization�datasets�are�single�target.�See�Table�1.�

5https://github.com/openreview/openreview-py�

https://github.com/openreview/openreview-pyhttps://OpenReview.org

4768

Number�of� Avg.�words� Avg.�words� Compression� %�novel�Dataset� Multi-target�documents� in�document� in�summary� ratio� words�

Non-scientific�documents�DUC�(Over,�2003)� 624� 441� 11� 40.1� 30.0� yes�NYTimes�(Sandhaus,�2008)� 655K� 549� 40� 13.7� 20.1� no�DailyMail�(Hermann�et�al.,�2015)� 220K� 653� 55� 11.9� 17.0� no�CNN�(Hermann�et�al.,�2015)� 93K� 760� 46� 16.5� 16.8� no�XSUM�(Narayan�et�al.,�2018)� 226K� 431� 23� 18.7� 35.8� no�Newsroom�(Grusky�et�al.)� 1.32M� 659� 27� 24.4� 26.0� no�BigPatent�(Sharma�et�al.,�2019)� 1.34M� 3.6K� 117� 30.5� 13.6� no�

Scientific�documents�CLPubSum�(Collins�et�al.,�2017)� 10.3K� 8.2K� 226� 36.5� 7.7� no�PubMed�(Cohan�et�al.,�2018)� 133K� 3K� 203� 14.9� 10.5� no�ArXiv�(Cohan�et�al.,�2018)� 215K� 4.9K� 220� 22.5� 8.3� no�SciSummNet† (Yasunaga�et�al.,�2019)� 1.0K� 4.7K� 150� 31.2� 7.4� no�TalkSumm‡ (Lev�et�al.,�2019)� 1.7K� 4.8K� 965� 5.0� 16.5� no�SCITLDR�(ours)� 3.2K� 5K� 21� 238.1� 15.2� yes�

Table 1: Comparison of�SCITLDR�to existing summarization datasets.�(i)�SCITLDR�provides multiple summary tar-gets�unlike�other�recent�summarization�datasets.�(ii)�SCITLDR�requires�both�extreme�compression�and�abstraction,�as�evidenced�by�the�compression�ratio�and�novelty�(%�of�summary�words�not�in�the�source�document),�especially�when�compared�with�other�scientific�summarization�datasets.�†SciScummNet�data�was�later�included�in�the�CL-SciSumm�shared�task�and�dataset�(Jaidka�et�al.,�2018;�Chandrasekaran�et�al.,�2019),�which�has�an�additional�40�manually�annotated�documents�and�its�statistics�are�similar�to�SciSummNet.�‡Unlike�the�other�summarization�datasets�presented�here,�TalkSumm�is�an�automatically-constructed�dataset�for�training;�the�TalkSumm-supervised�model�in�Lev�et�al.�(2019)�was�evaluated�using�CL-SciSumm�(Jaidka�et�al.,�2018).�

full-text�PDFs6�of�those�papers.�We�use�the�S2ORC�pipeline�(Lo�et�al.,�2020)�to�convert�PDFs�to�struc-tured,�machine-readable�full� text.� We�then�split�the�papers�randomly�into�the�previously-mentioned�train,�dev,�and�test�sets;�each�paper�at�this�point�has�an�associated�author-written�gold�TLDR.�

Rewriting� peer� reviews� into� TLDR-PR� pairs�Scaling�up�data�collection�in�a�specialized�scien-tific�domain�is�costly�and�challenging.�To�sidestep�this�problem,�we�use�a�novel�annotation�protocol�that�exploits�natural�summaries�in�peer�review�com-ments.� Assuming� the� typical�peer� reviewer�has�carefully�scrutinized�the�source�paper�and�provided�a�faithful�summary�in�their�comment�(often�in�the�first�paragraph),�domain�experts�can�rewrite�these�comments�into�TLDRs.�

For�this�task,�we�recruit�28�undergraduate�com-puter�science�students�from�the�University�of�Wash-ington�with�self-reported�experience�in�reading�sci-entific�papers.�Each�recruited�student�received�one�hour�of�one-on-one�writing�training�and�then�was�asked�to�work�independently.�Annotators�were�only�

6A�small�fraction�of�those�papers�(< 5%)�did�not�have�an�available�PDF�file,�so�we�could�not�parse�their�full�body�text.�This�are�still�included�the�dataset�as�it�is�possible�to�generate�a�TLDR�from�an�abstract�alone.�

shown�the�first�128�words�of�a�sampled7� peer�re-view�comment.�They�were�instructed�to�keep�their�TLDRs�between�15-25�words�(similar�to�the�length�of�an�author�written�TLDR)�and�to�skip�reviews�that�do�not�contain�a�summary�or�if�they�did�not�under-stand�the�content.�They�were�also�instructed�to�use�the�original�language�in�the�review,�when�possible.�We�manually�assessed�every�written�summary,�dis-carding�TLDRs�that�did�not�adhere�to�the�guidelines,�and�allowed�20/28�students�who�performed�well�to�continue�work�beyond�the�first�hour.� Students�were�compensated�at�the�local�median�hourly�wage�of�$20�USD�per�hour.� Refer�to�Appendix�§F�for�full�annotation�instructions.�Figure�2�contains�an�example�of�a�peer� review�and� its�corresponding�TLDR-PR.�We�discuss�differences�between�TLDR-PR�and�TLDR-Auth�throughout�Section�3.�

3� Dataset�analysis�

3.1� Compression�and�abstractiveness�Table�1�compares�SCITLDR�with�other�summariza-tion�datasets�in�both�scientific�and�non-scientific�domains.�We�observe�that�SCITLDR�has�short�sum-maries,� like� XSUM� and� NewsRoom,� with� long�

7Multiple�peer�review�comments�can�be�available�for�each�paper�on�OpenReview.� We�focused�on�ensuring� that�each�paper�in�dev�and�test�had�at�least�one�TLDR-PR.�

4769

source� documents,� like� BigPatent� and� the� other�scientific-domain�datasets.�This�results�in�a�much�higher�compression�ratio�compared�with�existing�datasets.� Summarization� in�higher�compression�settings� is� challenging� as� it� requires� capturing�more�precisely�the�salient�aspects�of�the�document�(Grusky�et�al.).�

Following�Narayan�et�al.�(2018);�Grusky�et�al.,�we�measure�abstractiveness�(or�novelty)�by�percent-age�of�words�in�the�summary�that�do�not�appear�in�the�source�document.� We�observe�that�SCITLDR�is�more�abstractive�compared�with�other�scientific�domain�datasets�but�less�abstractive�compared�with�non-scientific�domain�datasets.� We�also�observe�that�SCITLDR�is�smaller�in�comparison�to�automat-ically�collected�datasets,�such�as�XSUM�and�ArXiv,�but�is�larger�in�comparison�to�other�manually�col-lected�datasets,�such�as�SciSummNet.�

3.2� Information�content�We�analyze�the�information�content�of�TLDRs�using�an�approach�motivated�by�the�nugget-based�sum-marization�evaluation�framework�of�Nenkova�and�Passonneau�(2004).�In�a�similar�manner,�we�asked�two�computer�science�researchers�to�read�through�a�collection�of�TLDRs�to�both�define�a�comprehensive�set�of�categories�of�types�of�information�present�in�TLDRs,�which�we�refer�to�as�nuggets.8� We�also�label�each�TLDR�with�all�represented�nuggets.�Ta-ble�2�presents�this�categorization,�along�with�ex-ample�phrases�and�nugget�occurrence�frequencies�of�SCITLDR.� For�simplicity,�we�use�the�category�codes�defined�in�the�table�(with�brackets)�to�refer-ence�specific�categories.�

Most� TLDRs� contain� between� two� to� four�nuggets�(never�all�six),�and�will�provide�some�indi-cation�of�their�subject�area�(A)�and�the�paper’s�con-tributions�(C).�In�fact,�they�are�the�most�frequently�co-occurring�nuggets,�appearing�in�63%�of�TLDR-Auth�and�71%�of�TLDR-PR.�TLDR-Auth�tend�to�include�results�or�scientific/theoretical�findings�(R)�and�often�signal�the�value�of�their�work�(V)�by�de-scribing�their�contributions�as�novel�or�their�results�as�strong�or�state-of-the-art.� In�contrast,� TLDR-PR�focus�more�on�articulating�problems�the�paper�addresses�(P).�Interestingly,�TLDR-PR�place�less�emphasis�on�R�and�V�in�favor�of�further�method-ological�details�in�the�paper�D.�More�details�about�nuggets�in�Appendix�§A.�

8While�we�adopt�the�term�‘nugget”�for�convenience,�we�recognize�that�that�they�traditionally�correspond�to�factoids,�while�here�they�correspond�to�discourse�roles�Teufel�(1999).�

Category� Example�phrase� %�of�TLDRs�AUTH�/�PR�

[A]rea,�field�or�topic�of�study�

reinforcement�learning,�dependency�parsing�

85.6�/�90.8�

[P]roblem�or�motivation�

mode�collapse,�catastrophic�forgetting�

29.0�/�32.9�

Mode�of�[C]ontribution�

method,�dataset,�proof,�theorem�

68.4�/�76.3�

[D]etails�or�description�

graph�convolution�operations�with�dynam-ically�computed�graphs�

43.4�/�57.9�

improved�performance�[R]esults�or�findings�

on�ImageNet,�simple�defenses�work�on�

29.0�/�17.1�

MNIST�but�not�CIFAR�

[V]alue�or�significance�

novel,�state-of-the-art,�simple�yet�effective,�easily�applicable�

23.7�/�7.9�

Table�2:� Example�categories�(or�nuggets)�of� informa-tion�a�TLDR� might�contain.� Proportion�of�TLDRs�con-taining�each�nugget�estimated�on�76�randomly�sampled�gold�papers� (each�with� its� TLDR-Auth�and�a�sampled�TLDR-PR).�Percentages�do�not�sum�to�one�because�each�TLDR�can�contain�multiple�nuggets.�

3.3� Variability�in�TLDRs�

To�explore�variability�in�our�human-written�sum-maries,�we�examine�differences�between� TLDRs�written�by�authors�(TLDR-Auth)�and�TLDRs�derived�from�the�perspective�of�a�peer�reviewer�(TLDR-PR).�

Lexical�variation� First,�we�note�that�TLDR-Auth�are�on�average�18.9�words�long,�while�TLDR-PR�are�slightly�longer�on�average�at�22.9�words.�Despite�similarities�in�length,�the�1-,�2-,�and�3-gram�mean�Jaccard�indices�between�TLDR-Auth�and�TLDR-PR�are�15.0%,�2.5%,�and�0.7%,�respectively,�indicating�extremely� little� lexical�overlap�between� the� two�sources�of�TLDRs.� We�can�also�observe�through�qualitative�examples�in�Figure�3�how�TLDR-Auth�and�TLDR-PR�can�differ�greatly,�even�when�they�contain�the�same�information�content.�

Abstractiveness� TLDR-PR� is� more� abstractive�with�a�novelty�score�of�20.2%�compared�with�TLDR-Auth�with�a�novelty�score�of�9.6%,�where�novelty�is�computed�as�the�percentage�of�words�in�the�TLDR�not� in� the�source�paper.� This� is�not�unexpected�because� TLDR-PR�are�derived� from�peer� review�comments� which� themselves� have� already� gone�through�one�stage�of�abstraction.�

4770

4� CATTS�

TLDR-Auth The authors propose a framework to learn agood policy through imitation learning from a noisy demonstra-tion set via meta-training a demonstration suitability assessor.TLDR-PR Contributes a maml based algorithm for imita-tion learning which automatically determines if provided demon-strations are ”suitable”.

TLDR-Auth The authors evaluate the effectiveness of hav-ing auxiliary discriminative tasks performed on top of statisticsof the posterior distribution learned by variational autoencodersto enforce speaker dependency.TLDR-PR Propose an autoencoder model to learn a rep-resentation for speaker verification using short-duration analysiswindows.

1

Figure 3:�Two example�TLDR-Auth and�TLDR-PR pairs�with�colored�spans�corresponding�to�nuggets�in�Table�3�–�A,�P,�C,�D.� On�top,�we�see�TLDRs�can�have�substan-tial� lexical�variation�despite�covering�similar� informa-tion�content.� On�bottom,�we�naturally�see�even�more�variation�when�the�information�content�differs.�

We�introduce�CATTS�(Controlled�Abstraction�for�TLDRs�with�Title�Scaffolding),�a�simple�yet�effec-tive�method�for�learning�to�generate�TLDRs.� Our�approach�addresses�two�main�challenges:� (1)�the�limited�size�of�the�training�data�and�(2)�the�need�for�domain�knowledge�in�order�to�write�high-quality�gold�TLDRs.�To�address�these�challenges,�we�pro-pose�using�titles�of�scientific�papers�as�additional�generation�targets.�As�titles�often�contain�key�infor-mation�about�a�paper,�we�hypothesize�that�training�a�model�to�generate�titles�will�allow�it�to�learn�how�to�locate�salient�information�in�the�paper�that�will�be�also�useful�for�generating�TLDRs.�In�addition,�all�papers�have�a�title,�and�thus�we�have�an�abundant�supply�of�paper-title�pairs�for�training.�

Incorporating�auxiliary�scaffold�tasks�via�multi-task�learning�has�been�studied�before�for�improving�span-labeling�and�text�classification�(Swayamdipta�et�al.,�2018b;�Cohan�et�al.,�2019).�Similar�to�mul-titask�learning,�training�on�heterogenous�data�an-notated�with�control�codes�has�been�shown�to�im-prove�controlled�generation�in�autoregressive�lan-guage�models�(Keskar�et�al.,�2019;�ElSahar�et�al.,�2020;�Sudhakar�et�al.,�2019;�Li�et�al.,�2020).� In�fact,� it� has� been� shown� effective� for� generating�biomedical�abstracts�(Sybrandt�and�Safro,�2020).�We� demonstrate� that� control� codes� can� be� used�to�effectively�incorporate�scaffold�tasks�(e.g.�title�generation)�for�denoising�autoencoders�like�BART�(Lewis�et�al.,�2020).�

In� order� to� use� title� generation� as� a� scaffold�task� for� TLDR� generation,� we�propose� shuffling�

Paper - Title pairs

Paper - TLDR pairs

Shuffled Data BART

Append codes to sourceSciTLDR

arXiv

Figure�4:�Training�regimen�for�CATTS.�

SCITLDR�with�a�title�generation�dataset,�then�ap-pending�each�source�with�control�codes�h|TLDR|iand�h|TITLE|i,�respectively.� This�allows�the�pa-rameters�of� the�model� to� learn� to�generate�both�TLDRs�and�titles.�This�process�is�visualized�in�Frig-ure�4.�At�generation�time,�the�appropriate�control�code�is�appended�to�the�source.�Additionally,�up-sampling�particular�tasks�can�be�viewed�as�applying�task-specific�weights,�similar�to�weighting�losses�in�multitask�learning�setups.�

5� Experiments�

5.1� Baselines�

We� establish� baselines� for� TLDR� generation� on�SCITLDR�using�state-of-the-art�extractive�and�ab-stractive�summarization�models.�

Extractive�methods� We�consider�both�unsuper-vised�and�supervised�extractive�methods.�For�our�unsupervised�baseline,�we�use�PACSUM�(Zheng�and�Lapata,�2019),�an�extension�of�TextRank�(Mi-halcea�and�Tarau,�2004)�that�uses�BERT�(Devlin�et�al.,�2019)�as�a� sentence�encoder.� For�our�su-pervised� baselines,� we� use� BERTSUMEXT� (Liu�and� Lapata,� 2019),� which� uses� BERT� as� a� sen-tence�encoder�augmented�with�inter-sentence�Trans-former�layers�to�capture�interactions,�and�Match-Sum� (Zhong� et� al.,� 2020),� which� uses� a� BERT�Siamese�network�to�score�whole�summaries.�

Abstractive�methods� Since�TLDRs�often�contain�information�spread�across�multiple�sentences,�we�expect�abstractive�summarization�methods�to�pro-duce� strong� results� for� this� task.� We� focus� on�BART�(Lewis�et�al.,�2020),�a�Transformer-based�denoising�autoencoder� for�pretraining� sequence-to-sequence�models.�We�use�BART-large,�which�achieves�state-of-the-art�results�in�summarization�on�XSUM.�We�additionally�use�BART-large�fine-tuned�on�XSUM,�hypothesizing�that�the�task�of�ex-treme�summarization�of�news�articles�might�trans-fer�to�TLDR�generation�on�SCITLDR.�

4771

Oracle� We�define�a�sentence-level�extractive�ora-cle:�Given�a�paper�and�its�multiple�gold�TLDRs,�it�selects�the�single�sentence�in�the�document�with�the�highest�Rouge�overlap�for�each�gold�TLDR.�Then�it�returns�the�single�sentence�that�yields�the�maximum�Rouge�across�all�gold�TLDRs.�This�sets�an�upper-bound�on� the�performance�of� the�sentence-level�extractive�methods�under�our�multi-target�evalua-tion�(Section�5.4).� Our�full� text�oracle�achieves�54.5�Rouge-1,�30.6�Rouge-2,�and�45.0�Rouge-L�on�the�test�set.�

5.2� Input�space�The� input� space� is� the� context� provided� to� the�model�when�generating�TLDRs.�

Abstract-only� Since�the�vast�majority�of�scien-tific�papers�do�not�have�open-access�full�text�(Lo�et�al.,�2020),�it�is�worth�considering�the�setting�in�which�we�generate�TLDRs�for�papers�given�only�their�abstracts�as�input.�The�average�length�of�an�abstract�is�159�words�and�resulting�compression�ratio�is�7.6.�

AIC� Previous�studies�have�found�that�the�most�salient�information�in�a�paper�for�writing�a�sum-mary�is�often�found�in�the�abstract,�introduction,�and� conclusion� (AIC)� sections� (Sharma� et� al.,�2019).� An�important�consequence�of� this� is� the�ability�to�substantially�reduce�computational�costs9�

(Schwartz�et�al.,�2019)�by�supplying�only�these�sec-tions�as�context.�The�average�combined�length�of�these�contexts�is�993�words�and�resulting�compres-sion�ratio�is�47.3,�which�is�still�higher�than�other�datasets�surveyed�in�Table�1.�

Comparing�oracle�results�in�Table�3,�we�see�that�increasing�the� input�space�from�abstract-only� to�AIC�improves�Rouge-1�by�+4.7.� Yet,�this�is�only�2.1�Rouge-1�lower�than�the�full�text�oracle�perfor-mance,�despite�requiring�five�times�more�text.�

5.3� Training�and�implementation�details�All�experiments�use�Titan�V�or�V100�GPUs.� We�experiment�on�abstract-only�and�AIC�input�spaces.�Best�hyperparameters�for�the�models�are�selected�based� on� dev� set� Rouge-1.� Supervised� models�like� BERTSUMEXT� and� BART� are� trained� on�SCITLDR� and�the�best�model�checkpoint�chosen�using�dev�set�loss.�See�Appendix§D�for�additional�parameter�tuning�details�of�all�models.�

9Especially�for�methods�that�rely�on�O(n 2) inter-sentence�comparisons�or�wrappers�around�Transformer-based�methods�to�long�contexts.�

Extractive� Methods� For� PACSUM,� BERT-SUMEXT�and�MatchSum�we�use�original�code�re-leased�by�the�authors.�The�first�two�use�BERT-base�and�the�last�one�uses�RoBERTa-base�(Liu�et�al.,�2019).� For�MatchSum� in�AIC� input� space,� fol-lowing�the�authors,�we�use�BERTSUMEXT�to�first�extract�7�highly�scoring�sentences�as�the�input�to�MatchSum.10� Sentence�segmentation�is�performed�using�ScispaCy�(Neumann�et�al.,�2019),�and�mod-els�select�a�single�sentence�as�their�predictions.�We�use�the�default�hyperparameters�for�PACSUM.�

Abstractive� Methods� We� experiment� with�BART-large�and�BART-large�finetuned�on�XSUM,�using�the�Fairseq�(Ott�et�al.,�2019)�implementation�and�the�released�XSUM�weights.� We�apply� the�CATTS�training�method�to�these�two�models,�using�an�additional�20K�paper-title�pairs�from�arXiv�for�title�generation.11� We�up-sample�TLDR�instances�to�match�the�size�of�the�title�scaffold�data.12� For�sim-plicity,�we�refer�to�these�as�BART,�BARTXSUM,�CATTS�and�CATTSXSUM,�respectively.�For�all�mod-els,�we�use�a�learning�rate�of�3e-5,�update�frequency�of�1,�and�max�tokens�per�batch�of�102413� chosen�through�manual�tuning.� We�tune�decoder�for�all�models�via�grid�search�over�five�length�penalties�between�0.2�and�1.0�and�7�beam�sizes�2�to�8.�

5.4� Evaluation�Automated� evaluation� Following� recent� work�on�extreme�summarization�(Narayan�et�al.,�2018;�Lewis�et�al.,�2020),�we�use�Rouge-1,�Rouge-2,�and�Rouge-L�(Lin,�2004)�as�our�automated�metrics.�As�discussed� in�Section�2,� we�have�multiple� target�summaries�available�per�paper.�To�exploit�this�dur-ing�evaluation,� we�calculate� the�Rouge�score�of�the�system-generated�TLDR� with�respect�to�each�of�the�gold�TLDRs�for�the�corresponding�paper�(in-cluding�its�TLDR-Auth�and�all�of�its�TLDRs-PR)�individually.�We�take�the�maximum�Rouge�score�over�these�gold�TLDRs�as�the�final�Rouge�score�for�that�paper.�An�alternative�approach�to�aggregating�scores�would�be�to�take�the�mean,�but�due�to�the�

10In�abstract-only�setting,�MatchSum�takes�the�full�context.�11Includes�all�papers�on�arXiv�with�at�least�one�of�the�follow-

ing�tags�CS.CL,�CS.CV,�CS.LG,�CS.AI,�CS.NE,�and�STAT.ML�and�have�identified�introduction�and�conclusion�sections�by�S2ORC�(Lo�et�al.,�2020).�

12While�this�up-sampling�may�indicate�that�CATTS�is�train-ing�on�more�TLDRs�than�BART,�we�allow�BART�training�up�to�20�epochs�and�it�quickly�overfits�within�a�few�epochs.�

13Fairseq�reports�an�“average�batch�size”�of�36,�which�is�a�consequence�of�adaptive�batching�of�examples�based�on�the�update�frequency�and�max�tokens�per�batch.�

4772

Abstract-only� AIC�

Method� R1� R2� RL� R1� R2� RL�

Oracle� 47.7� 24.7� 38.5� 52.4� 29.0� 42.9�PACSUM�(Zheng�and�Lapata,�2019)�BERTSUMEXT�(Liu�and�Lapata,�2019)�MatchSum�(Zhong�et�al.,�2020)�

19.3�38.5�42.7�

4.0�16.6�20.0�

15.1�30.5�34.0�

28.7�36.2�38.6�

9.8�14.7�16.4�

21.9�28.5�30.1�

BART�(Lewis�et�al.,�2020)�BARTXSUM (Lewis�et�al.,�2020)�CATTS�(Ours)�CATTSXSUM (Ours)�

43.3�42.5�43.8�

†44.3�

20.8�21.1�20.9�21.3�

35.0�34.9�35.5�35.9�

42.9�43.7�

†44.9�44.6�

20.8�21.4�

†22.6�21.7�

35.1�36.0�

†37.3�36.5�

Table�3:�Test�set�max�Rouge�scores�of�extractive�and�abstractive�baselines�and�CATTS.�We�use�† to�indicate�CATTS�variants�that�significantly�(p

4773

Abstract-only� AIC�

Method� %�novel�words�Avg.�#�words�

%�novel�words�

Avg.�#�words�

BART�BARTXSUM CATTS�CATTSXSUM

2.9%�3.7%�5.5%�5.8%�

20.9�18.4�19.1�19.7�

1.3%�1.1%�5.3%�4.5%�

20.4�18.9�18.4�19.7�

Table�5:�Lexical�features�of�system-generated�TLDRs.�

Method� R1� � R2� � RL� �

BART� 44.9� +1.6� 22.6� +1.8� 37.1� +2.1�BARTXSUM 44.8� +1.1� 21.8� +0.4� 36.4� +0.4�CATTS� 44.9� +0.0� 21.9� -0.7� 36.6� -0.7�CATTSXSUM 45.7� +1.1� 23.0� +1.7� 37.1� +1.2�

Table� 6:� Oracle� input� space� experiments.� � are� dif-ferences�between�oracle�result�and�model’s�best�perfor-mance�(across�abstract-only�and�AIC)�from�Table�3.�

Abstractive�results� Abstractive�methods�are�not�limited�to�choosing�exact�sentences.� For�a�given�abstractive�baseline� BART� or� BARTXSUM,� our�CATTS�learning�strategy�results�in�improvements�in�both�abstract-only�and�AIC�settings.�Comparing�CATTS� variants�with�their�corresponding�BART�baselines,�we�observe�that�in�the�abstract-only�set-ting,� CATTS� and� CATTSXSUM achieve�+0.5�and�+1.8� Rouge-1,� respectively.� In� the� AIC� setting,�CATTS� and� CATTSXSUM achieve�+2.0�and�+0.9�Rouge-1,�respectively.�We�use�the�two-sided�paired�t-test�against�a�null�hypothesis�of�no�difference�to�assess�these�differences.� To�address�the�issue�of�multiple�hypothesis�testing�over�Rouge�scores,�we�perform�a�Holm-Bonferroni�(Holm,�1979)16�correc-tion�for�determining�significant�p-values�in�Table�3.�

6.2� Human�evaluation�We�perform�our�human�evaluation�on�BARTXSUM and�CATTSXSUM using�the�AIC�input�space�on�51�sampled�papers.�In�this�setting,�we�have�both�cho-sen�the�strongest�baseline�and�controlled�for�XSUM�pretraining.�From�Table�4,�we�see�that�CATTSXSUM is�more�informative�than�BARTXSUM and�is�com-parable�to�gold�TLDR-Auth,�though�still�less�infor-mative�than�TLDR-PR.�

In�addition�to�informativeness,�we�also�evaluate�content�accuracy�of�generated�tldrs�as�explained�in�Section�5.4.�We�report�no�difference�in�correctness�between�BARTXSUM and�CATTSXSUM.� We�ob-serve�42�ties,�10�cases�where�BARTXSUM is�more�correct,�and�12�cases�where�CATTSXSUM is�more�

16Using�the�P.ADJUST�library�in�R�(R�Core�Team,�2018)�

correct.�Both�models�average�a�rating�of�2.5�(scor-ing�between�partially�accurate�and�mostly�correct).�

6.3� Analysis�How� abstractive� are� the� generations?� From�Table�5,�we�observe:� (1)�BART�variants�are�less�abstractive�than�CATTS�variants.�(2)�Initial�training�on�XSUM�might�influence�models�to�be�slightly�less�abstractive.� (3)�BART�variants�are�more�ab-stractive�in�the�abstract-only�setting�than�the�longer�AIC�settings,�while�CATTS�seems�to�have�the�same�level�of�abstractiveness�regardless�of�input�space.�

How�long�are�the�generations?� From�Table�5,�we�see�the�systems�all�generate�TLDRs�of�similar�length�to�the�average�length�reported�in�Table�1.�

How�important�is�using�the�full�text?� To�ana-lyze�whether�one�can�improve�abstractive�model�performance�by�improving�the�input�space�selec-tion�(compared�to�just�using�AIC),�we�define�an�or-acle�input�space.�That�is,�for�each�TLDR,�we�select�sentences�from�the�full�text�that�maximize�Rouge-1�with�the�gold�TLDRs-Auth17� and�select�the�top�sentences�to�match�the�length�of�AIC.�Repeating�the�experiments�in�Section�5�with�this�input�source,�we�observe�some�performance�improvement�across�models�(Table�6).�

Qualitative� example� Table� 7� contains� system�generations�on�the�same�paper�(alongside�the�gold�TLDRs).�Curiously,�despite�both�achieving�the�same�Rouge-1,�the�generated�TLDRs�are�quite�different.�BARTXSUM focuses�on�the�methodological�con-tribution�while� CATTSXSUM focuses�on�a� scien-tific� finding.� The� “two� hidden� layer”� detail� by�BARTXSUM is� from� the�paper� introduction�and�the� “defining� the� appropriate� sampling� distribu-tions”�from�CATTSXSUM is�from�the�conclusion.18�

7� Related�work�

Transformers�for�summarization� Transformer-based�models�have�achieved�strong�results�in�ex-tractive�and�abstractive�summarization.�PACSUM�(Zheng�and�Lapata,�2019)�combines�BERT�sen-tence�representation�with�unsupervised�text�rank-ing;�MatchSum�(Zhong�et�al.,�2020)�uses�a�Siamese�BERT�model�to�score�the�entire�summary�instead�of�a�single�extraction;�and�Liu�and�Lapata�(2019)�

17Only�TLDRs-Auth�is�exists�for�all�papers.�TLDRs-PR�are�only�in�dev�and�test.�

18See�original�paper:�https://openreview.net/pdf?id=SkGT6sRcFX�

https://openreview.net/pdf?id=SkGT6sRcFX

4774

BARTXSUM We�propose�a�principled�approach�to�weight�initialisation�that�allows�the�construction�of�infinite-width�net-works�with�more�than�two�hidden�layers.�CATTSXSUM We�study�the�initialisation�requirements�of�infinite-width� networks� and� show� that� the� main� challenge�for�constructing�them�is�defining�the�appropriate�sampling�distributions�for�the�weights.�

TLDR-Auth� We�propose�a�method�for�the�construction�of�arbitrarily�deep�infinite-width�networks,�based�on�which�we�derive�a�novel�weight�initialisation�scheme�for�finite-width�networks�and�demonstrate�its�competitive�performance.�TLDR-PR� Proposes� a�weight� initialization� approach� to�enable�infinitely�deep�and�infinite-width�networks�with�experi-mental�results�on�small�datasets.�

Table�7:�Examples�of�system�generations.�BARTXSUM and�CATTSXSUM both�achieve�Rouge-1�of�40.7�on�this�paper.�Colored�spans�indicate�text�overlap.�

show�that�BERT�is�effective�for�both�extractive�and�abstractive�summarization.�Zhang�et�al.�(2019);�Bi�et�al.�(2020)�introduce�new�pretraining�objectives�that� improve�generation.� Sequence-to-sequence�models�(Raffel�et�al.,�2020;�Lewis�et�al.,�2020;�Bao�et�al.,�2020)�have�state-of-the-art�performance�on�XSUM�(Narayan�et�al.,�2018),�a�dataset�for�extreme�summarization�dataset�of�news�articles.�SCITLDR�is�a�new�form�of�extreme�summarization�focused�on�scientific�papers.�

Scientific� document� summarization� Most�work�in�summarization�of�scientific�papers�have�focused�on�longer�summaries�(i.e.�150-200�words).�Existing�datasets�include�CSPubSum�for�extractive�summarization�(Collins�et�al.,�2017),�ArXiv�and�PubMed� for� abstract� generation� (Cohan� et� al.,�2018),�and�SciSummNet�(Yasunaga�et�al.,�2019)�and� CL-SciSumm� (Jaidka� et� al.,� 2018;� Chan-drasekaran�et�al.,�2019)�datasets,�which�incorporate�citation�contexts� into�human-written� summaries.�TalkSumm�(Lev�et�al.,�2019)�uses�recordings�of�conference�talks�to�create�a�distantly-supervised�training�set�for�the�CL-SciSumm�task.�

Modeling� approaches� in� scientific� document�summarization�include�models�that�exploit�citation�contexts�(Qazvinian�et�al.,�2013;�Cohan�and�Go-harian,�2015,�2017;�Zerva�et�al.,�2020),�automated�survey�generation�(Mohammad�et�al.,�2009;�Jha�et�al.,�2015;�Fabbri�et�al.,�2018;�Wang�et�al.,�2018),�and�other� techniques� focusing�on�exploiting� the�unique�properties�of�scientific�documents�such�as�long�length�and�structure�(Conroy�and�Davis,�2017;�Nikolov�et�al.,�2018;�Cohan�et�al.,�2018;�Xiao�and�Carenini,�2019).�Yet,�such�methods�have�not�been�

studied� in� the�setting�of�extreme�summarization�(i.e.� short� target� summaries,� high� compression,�high�abstraction),�and�SCITLDR�is�the�first�dataset�to�facilitate�such�research.�

8� Conclusion�

We�introduce� TLDR� generation� for� scientific�pa-pers,�and�release�SCITLDR,�a�multi-target�dataset�of�TLDR-paper�pairs.� We�also�present�CATTS,�a�simple�yet�effective�learning�strategy�for�improving�TLDR�generation�that�exploits�auxiliary�training�sig-nal�from�paper�titles.�We�show�that�our�approach�improves�over�strong�modeling�baselines.�

Existing�methods�for�scientific�document�sum-marization�often�make�use�of�properties�unique�to�those�papers,�like�sections,�citation�contexts�or�sci-entific�discourse�roles.� Future�work�can�examine�how�best�to�incorporate�these�properties�to�improve�TLDR�generation�models.�Additionally,�while�our�experiments�are�limited�to�abstract-only�and�AIC�input�spaces,�we�provide�the�full�text�of�the�source�papers�to�support�research�into�using�longer�input�contexts.� Furthermore,� the�multiple� target� sum-maries�in�SCITLDR�reflect�diverse�perspectives�and�can�be�used�to�support�summarization�research�into�training�and�evaluation�techniques�previously�un-available�with�existing�datasets.� Finally,�the�idea�of�a�TLDR�can�differ�between�academic�disciplines,�and�we�leave�such�exploration�open�for�future�work.�

Acknowledgments�

We�thank�the�Semantic�Scholar�Research�team�and�John�Bohannon�and�Oleg�Vasilyev�from�Primer�for�helpful�feedback�and�discussions.�This�work�was�supported�in�part�by�NSF�Convergence�Accelera-tor�award�1936940,�NSF�RAPID�award�2040196,�ONR�grant�N00014-18-1-2193,�and�the�University�of�Washington�WRF/Cable�Professorship.�

References�Hangbo� Bao,� Li� Dong,� Furu� Wei,� Wenhui� Wang,�

Nan� Yang,� Xiulei� Liu,� Yu� Wang,� Songhao� Piao,�Jianfeng� Gao,� Ming� Zhou,� and� Hsiao-Wuen� Hon.�2020.� Unilmv2:� Pseudo-masked� language� mod-els�for�unified�language�model�pre-training.� ArXiv,�abs/2002.12804.�

Bin�Bi,�Chenliang�Li,�Chen�Wu,�Ming�Yan,� and�Wei�Wang.� 2020.� Palm:� Pre-training� an� autoencod-ing�and�autoregressive�language�model�for�context-conditioned�generation.�ArXiv,�abs/2004.07159.�

4775

Muthu� Kumar� Chandrasekaran,� Michihiro� Yasunaga,�Dragomir�Radev,�Dayne�Freitag,�and�Min-Yen�Kan.�2019.� Overview� and� results:� Cl-scisumm� shared�task�2019.� In�Workshop�on�Bibliometric-enhanced�Information�Retrieval�and�NLP�for�Digital�Libraries�(BIRNDL).�

Arman�Cohan,�Waleed�Ammar,�Madeleine�van�Zuylen,�and�Field�Cady.�2019.� Structural�scaffolds�for�cita-tion�intent�classification�in�scientific�publications.� In�NAACL-HLT.�

Arman� Cohan,� Franck� Dernoncourt,� Doo� Soon� Kim,�Trung�Bui,�Seokhwan�Kim,�Walter�Chang,�and�Nazli�Goharian.�2018.� A�discourse-aware�attention�model�for�abstractive�summarization�of�long�documents.� In�NAACL-HLT.�

Arman�Cohan�and�Nazli�Goharian.�2015.� Scientific�ar-ticle�summarization�using�citation-context�and�arti-cle’s�discourse�structure.� In�EMNLP.�

Arman� Cohan� and� Nazli� Goharian.� 2016.� Revisit-ing�summarization�evaluation�for�scientific�articles.�ArXiv,�abs/1604.00400.�

Arman� Cohan� and� Nazli� Goharian.� 2017.� Scientific�document�summarization�via�citation�contextualiza-tion�and�scientific�discourse.� International�Journal�on�Digital�Libraries,�19:287–303.�

Ed�Collins,�Isabelle�Augenstein,�and�Sebastian�Riedel.�2017.�A�supervised�approach�to�extractive�summari-sation�of�scientific�papers.�CoNLL,�abs/1706.03946.�

John�M.�Conroy�and�Sashka�Davis.�2017.�Section�mix-ture�models�for�scientific�document�summarization.�IJDL,�19:305–322.�

John�M�Conroy,� Judith�D�Schlesinger,� and�Dianne�P�O’Leary.�2011.� Nouveau-rouge:� A�novelty�metric�for�update�summarization.� Computational�Linguis-tics,�37(1):1–8.�

Jacob� Devlin,� Ming-Wei� Chang,� Kenton� Lee,� and�Kristina�Toutanova.�2019.�Bert:�Pre-training�of�deep�bidirectional� transformers�for� language�understand-ing.�ArXiv,�abs/1810.04805.�

Hady�ElSahar,�Maximin�Coavoux,�Matthias�Gallé,�and�Jos� Rozen.� 2020.� Self-supervised� and� controlled�multi-document� opinion� summarization.� ArXiv,�abs/2004.14754.�

Alexander�Fabbri,�Irene�Li,�Prawat�Trairatvorakul,�Yi-jiao� He,� Weitai� Ting,� Robert� Tung,� Caitlin� West-erfield,� and�Dragomir�Radev.� 2018.� TutorialBank:�A�manually-collected�corpus�for�prerequisite�chains,�survey�extraction�and�resource�recommendation.� In�ACL.�

Max� Grusky,� Mor� Naaman,� and� Yoav� Artzi.� News-room:� A�dataset�of�1.3�million�summaries�with�di-verse�extractive�strategies.� In�NAACL-HLT.�

Donna� Harman� and� Paul� Over.� 2004.� The� effects�of�human�variation� in�DUC�summarization�evalua-tion.� In� Text� Summarization� Branches� Out,� pages�10–17,�Barcelona,�Spain.�Association�for�Computa-tional�Linguistics.�

Karl�Moritz�Hermann,�Tomas�Kocisky,�Edward�Grefen-stette,�Lasse�Espeholt,�Will�Kay,�Mustafa�Suleyman,�and�Phil�Blunsom.�2015.�Teaching�machines�to�read�and�comprehend.� In�Advances�in�neural�information�processing�systems,�pages�1693–1701.�

Sture�Holm.�1979.�A�simple�sequentially�rejective�mul-tiple�test�procedure.� Scandinavian�Journal�of�Statis-tics,�6(2):65–70.�

Kokil� Jaidka,� Muthu� Kumar� Chandrasekaran,� Sajal�Rustagi,�and�Min-Yen�Kan.�2018.� Insights�from�cl-scisumm�2016:�the�faceted�scientific�document�sum-marization�shared�task.� IJDL,�19(2-3):163–171.�

Rahul�Jha,�Reed�Coke,�and�Dragomir�R.�Radev.�2015.�Surveyor:� A�system�for�generating�coherent�survey�articles�for�scientific�topics.� In�AAAI.�

Nitish� Shirish� Keskar,� Bryan� McCann,� Lav� R.�Varshney,� Caiming� Xiong,� and� Richard� Socher.�2019.� CTRL:� A� Conditional� Transformer� Lan-guage� Model� for� Controllable� Generation.� ArXiv,�abs/1909.05858.�

Guy� Lev,� Michal� Shmueli-Scheuer,� Jonathan� Herzig,�Achiya� Jerbi,� and� David� Konopnicki.� 2019.� Talk-summ:�A�dataset�and�scalable�annotation�method�for�scientific�paper�summarization�based�on�conference�talks.� In�ACL.�

Mike� Lewis,� Yinhan� Liu,� Naman� Goyal,� Mar-jan�Ghazvininejad,� Abdelrahman�Mohamed,� Omer�Levy,� Ves� Stoyanov,� and� Luke� Zettlemoyer.� 2020.�Bart:� Denoising�sequence-to-sequence�pre-training�for� natural� language� generation,� translation,� and�comprehension.�ACL.�

Kun� Li,� Chengbo� Chen,� Xiaojun� Quan,� Qing� Ling,�and� Yan� Song.� 2020.� Conditional� augmentation�for�aspect� term�extraction�via�masked�sequence-to-sequence�generation.�ArXiv,�abs/2004.14769.�

Chin-Yew� Lin.� 2004.� ROUGE:� A� package� for� auto-matic�evaluation�of�summaries.� In�Text�Summariza-tion�Branches�Out,�pages�74–81,�Barcelona,�Spain.�Association�for�Computational�Linguistics.�

Yang�Liu�and�Mirella�Lapata.�2019.� Text�summariza-tion�with�pretrained�encoders.� In�EMNLP/IJCNLP.�

Yinhan�Liu,�Myle�Ott,�Naman�Goyal,�Jingfei�Du,�Man-dar� Joshi,� Danqi� Chen,� Omer� Levy,� Mike� Lewis,�Luke� Zettlemoyer,� and� Veselin� Stoyanov.� 2019.�Roberta:� A� robustly�optimized�bert�pretraining�ap-proach.�arXiv�preprint�arXiv:1907.11692.�

Kyle�Lo,�Lucy�Lu�Wang,�Mark�Neumann,�Rodney�Kin-ney,�and�Daniel�S.�Weld.�2020.� S2orc:� The�seman-tic�scholar�open�research�corpus.� In�Proceedings�of�ACL.�

https://doi.org/10.18653/v1/N19-1361https://doi.org/10.18653/v1/N19-1361https://doi.org/10.18653/v1/P18-1057https://doi.org/10.18653/v1/P18-1057https://doi.org/10.18653/v1/P18-1057https://doi.org/10.18653/v1/N18-1065https://doi.org/10.18653/v1/N18-1065https://doi.org/10.18653/v1/N18-1065https://www.aclweb.org/anthology/W04-1003https://www.aclweb.org/anthology/W04-1003https://www.aclweb.org/anthology/W04-1003http://www.jstor.org/stable/4615733http://www.jstor.org/stable/4615733https://www.aclweb.org/anthology/W04-1013https://www.aclweb.org/anthology/W04-1013https://arxiv.org/abs/1911.02782https://arxiv.org/abs/1911.02782

4776

Rada�Mihalcea�and�Paul�Tarau.�2004.�Textrank:�Bring-ing�order�into�texts.� In�EMNLP.�

Saif� M.� Mohammad,� Bonnie� J.� Dorr,� Melissa� Egan,�Ahmed�Hassan�Awadallah,�Pradeep�Muthukrishnan,�Vahed�Qazvinian,�Dragomir�R.�Radev,�and�David�M.�Zajic.�2009.� Using�citations�to�generate�surveys�of�scientific�paradigms.� In�HLT-NAACL.�

Shashi�Narayan,� Shay�B.�Cohen,� and�Mirella�Lapata.�2018.� Don’t�give�me�the�details,�just�the�summary!�topic-aware� convolutional� neural� networks� for� ex-treme�summarization.�pages�1797–1807.�

Ani�Nenkova�and�Rebecca�J�Passonneau.�2004.� Evalu-ating�content�selection�in�summarization:� The�pyra-mid�method.� In�NAACL.�

Mark�Neumann,�Daniel�King,�Iz�Beltagy,�and�Waleed�Ammar.�2019.�Scispacy:�Fast�and�robust�models�for�biomedical�natural�language�processing.�

Nikola�I.�Nikolov,�Michael�Pfeiffer,�and�Richard�H.�R.�Hahnloser.�2018.� Data-driven�summarization�of�sci-entific�articles.�ArXiv,�abs/1804.08875.�

Myle� Ott,� Sergey� Edunov,� Alexei� Baevski,� Angela�Fan,� Sam�Gross,� Nathan�Ng,� David�Grangier,� and�Michael� Auli.� 2019.� fairseq:� a� fast,� extensible�toolkit� for� sequence� modeling.� In� NAACL-HLT,�Demonstrations.�

Paul�Over.�2003.� An�introduction�to�duc�2003:� Intrin-sic� evaluation� of� generic� news� text� summarization�systems.� In�Proceedings�of�Document�Understand-ing�Conference�2003.�

Vahed� Qazvinian,� Dragomir� R.� Radev,� Saif� M.� Mo-hammad,�Bonnie�J.�Dorr,�David�M.�Zajic,�Michael�Whidby,� and� Taesun� Moon.� 2013.� Generating� ex-tractive�summaries�of�scientific�paradigms.� J.�Artif.�Intell.�Res.,�46:165–201.�

R�Core�Team.�2018.� R:�A�Language�and�Environment�for�Statistical�Computing.�R�Foundation�for�Statisti-cal�Computing,�Vienna,�Austria.�

Colin�Raffel,�Noam�Shazeer,�Adam�Roberts,�Katherine�Lee,�Sharan�Narang,�Michael�Matena,�Yanqi�Zhou,�Wei�Li,�and�Peter�J.�Liu.�2020.� Exploring�the�limits�of�transfer�learning�with�a�unified�text-to-text�trans-former.�JMLR,�21(140):1–67.�

Evan�Sandhaus.�2008.� The�new�york�times�annotated�corpus.(october�2008).�ldc�catalog�no.:�Ldc2008t19.�

Roy�Schwartz,�Jesse�Dodge,�Noah�A.�Smith,�and�Oren�Etzioni.�2019.�Green�ai.�

Eva�Sharma,�Chen�Li,�and�Lu�Wang.�2019.� Bigpatent:�A� large-scale� dataset� for� abstractive� and� coherent�summarization.� In�ACL.�

Akhilesh�Sudhakar,�Bhargav�Upadhyay,�and�Arjun�Ma-heswaran.� 2019.� “transforming”� delete,� retrieve,�generate�approach�for�controlled�text�style� transfer.�

In�Proceedings�of�the�2019�Conference�on�Empirical�Methods� in� Natural� Language� Processing� and� the�9th�International�Joint�Conference�on�Natural�Lan-guage� Processing� (EMNLP-IJCNLP),� pages� 3269–�3279,�Hong�Kong,�China.�Association�for�Computa-tional�Linguistics.�

Swabha� Swayamdipta,� Sam� Thomson,� Kenton� Lee,�Luke�Zettlemoyer,�Chris�Dyer,�and�Noah�A.�Smith.�2018a.� Syntactic� scaffolds� for� semantic� structures.�In�EMNLP.�

Swabha� Swayamdipta,� Sam� Thomson,� Kenton� Lee,�Luke�Zettlemoyer,�Chris�Dyer,�and�Noah�A.�Smith.�2018b.� Syntactic�scaffolds�for�semantic�structures.�In�EMNLP.�

Justin� Sybrandt� and� Ilya� Safro.� 2020.� Cbag:� Con-ditional� biomedical� abstract� generation.� ArXiv,�abs/2002.05637.�

S.�Teufel.�1999.�Argumentative�zoning�information�ex-traction�from�scientific�text.�

Richard�Van�Noorden.�2014.� Global�scientific�output�doubles�every�nine�years.�Nature�news�blog.�

Ashish� Vaswani,� Noam� Shazeer,� Niki� Parmar,� Jakob�Uszkoreit,� Llion� Jones,� Aidan� N� Gomez,� Ł� ukasz�Kaiser,�and�Illia�Polosukhin.�2017.� Attention�is�all�you�need.� In�I.�Guyon,�U.�V.�Luxburg,�S.�Bengio,�H.�Wallach,�R.�Fergus,�S.�Vishwanathan,�and�R.�Gar-nett,�editors,�NeurIPS.�

Jie�Wang,�Chengzhi�Zhang,�Mengying�Zhang,�and�San-hong�Deng.�2018.� Citationas:� A�tool�of�automatic�survey�generation�based�on�citation�content.� Jour-nal�of�Data�and�Information�Science,�3(2):20–37.�

Wen� Xiao� and� Giuseppe� Carenini.� 2019.� Extractive�summarization� of� long� documents� by� combining�global�and�local�context.� In�EMNLP/IJCNLP.�

Michihiro�Yasunaga,�Jungo�Kasai,�Rui�Zhang,�Alexan-der� Richard� Fabbri,� Irene� Li,� Dan� Friedman,� and�Dragomir�R.�Radev.�2019.� Scisummnet:�A�large�an-notated�corpus�and�content-impact�models�for�scien-tific�paper�summarization�with�citation�networks.� In�AAAI.�

Klaus�Zechner.�1996.�Fast�generation�of�abstracts�from�general�domain� text�corpora�by�extracting� relevant�sentences.� In� Proceedings� of� the� 16th� conference�on�Computational�linguistics-Volume�2,�pages�986–�989.�

Chrysoula� Zerva,� Minh-Quoc� Nghiem,� Nhung� T.� H.�Nguyen,� and�S.�Ananiadou.�2020.� Cited� text�span�identification�for�scientific�summarisation�using�pre-trained�encoders.�Scientometrics,�pages�1�–�29.�

Jingqing�Zhang,�Yao�Zhao,�Mohammad�Saleh,�and�Pe-ter� J.� Liu.� 2019.� Pegasus:� Pre-training� with� ex-tracted�gap-sentences�for�abstractive�summarization.�ArXiv,�abs/1912.08777.�

https://doi.org/10.18653/v1/D18-1206https://doi.org/10.18653/v1/D18-1206https://doi.org/10.18653/v1/D18-1206http://arxiv.org/abs/arXiv:1902.07669http://arxiv.org/abs/arXiv:1902.07669https://www.R-project.org/https://www.R-project.org/http://jmlr.org/papers/v21/20-074.htmlhttp://jmlr.org/papers/v21/20-074.htmlhttp://jmlr.org/papers/v21/20-074.htmlhttp://arxiv.org/abs/1907.10597https://doi.org/10.18653/v1/D19-1322https://doi.org/10.18653/v1/D19-1322http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdfhttp://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

4777

Hao�Zheng�and�Mirella�Lapata.�2019.� Sentence�cen-trality�revisited�for�unsupervised�summarization.� In�ACL.�

Ming�Zhong,�Pengfei�Liu,�Yiran�Chen,�Danqing�Wang,�Xipeng�Qiu,�and�Xuanjing�Huang.�2020.� Extractive�summarization�as�text�matching.�ACL.�