Scientific Article Summarization Using Citation-Context

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 390–400,Lisbon, Portugal, 17-21 September 2015. c©2015 Association for Computational Linguistics.

Scientific Article Summarization Using Citation-Contextand Article’s Discourse Structure

Arman CohanInformation Retrieval Lab

Computer Science DepartmentGeorgetown University

[email protected]

Nazli GoharianInformation Retrieval Lab

Computer Science DepartmentGeorgetown University

[email protected]

Abstract

We propose a summarization approach forscientific articles which takes advantageof citation-context and the documentdiscourse model. While citations havebeen previously used in generatingscientific summaries, they lack the relatedcontext from the referenced article andtherefore do not accurately reflect thearticle’s content. Our method overcomesthe problem of inconsistency betweenthe citation summary and the article’scontent by providing context for eachcitation. We also leverage the inherentscientific article’s discourse for producingbetter summaries. We show that ourproposed method effectively improvesover existing summarization approaches(greater than 30% improvement over thebest performing baseline) in terms ofROUGE scores on TAC2014 scientificsummarization dataset. While the datasetwe use for evaluation is in the biomedicaldomain, most of our approaches aregeneral and therefore adaptable to otherdomains.

1 Introduction

Due to the expanding rate at which articlesare being published in each scientific field, ithas become difficult for researchers to keep upwith the developments in their respective fields.Scientific summarization aims to facilitate thisproblem by providing readers with concise andinformative representation of contributions orfindings of an article. Scientific summarizationis different than general summarization in threemain aspects (Teufel and Moens, 2002). First, thelength of scientific papers are usually much longerthan general articles (e.g newswire). Second,

in scientific summarization, the goal is typicallyto provide a technical summary of the paperwhich includes important findings, contributionsor impacts of a paper to the community. Finally,scientific papers follow a natural discourse. Acommon organization for scientific paper is theone in which the problem is first introduced andis followed by the description of hypotheses,methods, experiments, findings and finally resultsand implications. Scientific summarizationwas recently further motivated by TAC2014biomedical summarization track1 in which theyplanned to investigate this problem in the domainof biomedical science.

There are currently two types of approachestowards scientific summarization. First is thearticles’ abstracts. While abstracts provide ageneral overview of the paper, they cannot beconsidered as an accurate scientific summary bythemselves. That is due to the fact that not allthe contributions and impacts of the paper areincluded in the abstract (Elkiss et al., 2008). Inaddition, the stated contributions are those that theauthors deem important while they might be lessimportant to the scientific community. Moreover,contributions are stated in a general and lessfocused fashion. These problems motivated theother form of scientific summaries, i.e., citationbased summaries. Citation based summary is asummary which is formed by utilizing a set ofcitations to a referenced article (Qazvinian andRadev, 2008; Qazvinian et al., 2013). Thisset of citations has been previously indicated asa good representation of important findings andcontributions of the article. Contributions statedin the citations are usually more focused than theabstract and contain additional information that isnot in the abstract (Elkiss et al., 2008).

However, citations may not accurately represent1Text Analysis Conference - http://www.nist.gov/ tac/

2014

390

the content of the referenced article as they arebiased towards the viewpoint of the citing authors.Moreover, citations may address a contribution ora finding regarding the referenced article withoutreferring to the assumptions and data under whichit was obtained.

The problem of inconsistency between thedegree of certainty of expressing findings betweenthe citing article and referenced article hasbeen also reported (De Waard and Maat, 2012).Therefore, citations by themselves lack the related“context” from the original article. We call thetextual spans in the reference articles that reflectthe citation, the citation-context. Figure 1 showsan example of the citation-context in the referencearticle (green color) for a citation in the citingarticle (blue color).

We propose an approach to overcome theaforementioned shortcomings of existingscientific summaries. Specifically, we extractcitation-context in the reference article for eachcitation. Then, by using the discourse facets ofthe citations as well as community structure of thecitation-contexts, we extract candidate sentencesfor the summary. The final summary is formed bymaximizing both novelty and informativeness ofthe sentences in the summary. We evaluate andcompare our methods against several well-knownsummarization methods. Evaluation results on theTAC2014 dataset show that our proposed methodscan effectively improve over the well-knownexisting summarization approaches. That is, weobtained greater than 30% improvement over thehighest performing baseline in terms of meanROUGE scores.

2 Related work

Document summarization is a relatively wellstudied area and various types of approaches fordocument summarization have been proposed inthe past twenty years.

Latent Semantic Analysis (LSA) has beenused in text summarization first by (Gongand Liu, 2001). Other variations of LSAbased summarization approaches have laterbeen introduced (Steinberger and Jezek, 2004;Steinberger et al., 2005; Lee et al., 2009; Ozsoyet al., 2010). Summarization approaches basedon topic modeling and Bayesian models havealso been explored (Vanderwende et al., 2007;Haghighi and Vanderwende, 2009; Celikyilmaz

Citing article:

... The general impression that has emerged is that transformationof human cells by Ras requires the inactivation of both the pRb and p53pathways, typically achieved by introducing DNA tumor virus oncoproteinssuch as SV40 large tumor antigen (T-Ag) or human papillomavirus E6 andE7 proteins ( Serrano et al., 1997 ).To address this question, we have been investigating the ...

Reference article (Serrano et al., 1997):... continued to incorporate BrdU and proliferate following introductionof H-ras V12. In agreement with previous reports ( 66 and 60), bothp53/ and p16/ MEFs expressing H-ras V12 displayed features of oncogenictransformation (e.g., refractile morphology, loss of contact inhibition), whichwere apparent almost immediately after H-ras V12 was transduced (data notshown). These results indicate that p53 and p16 are essential for ras-inducedarrest in MEFs, and that inactivation of either p53 or p16 alone is sufficientto circumvent arrest. In REF52 and IMR90 fibroblasts, a different approachwas ...

Figure 1: The blue highlighted span in the citing article(top) shows the citation text, followed by the citation marker(pink span). For this citation, the citation-context is thegreen highlighted span in the reference article (bottom). Thetext spans outside the scope of the citation text and citation-context are not highlighted.

and Hakkani-Tur, 2010; Ritter et al., 2010;Celikyilmaz and Hakkani-Tur, 2011; Ma andNakagawa, 2013; Li and Li, 2014). In theseapproaches, the content/topic distribution in thefinal summary is estimated using a graphicalprobabilistic model. Some approaches haveviewed summarization as an optimization tasksolved by linear programming (Clarke andLapata, 2008; Berg-Kirkpatrick et al., 2011;Woodsend and Lapata, 2012). Many workshave viewed the summarization problem as asupervised classification problem in which severalfeatures are used to predict the inclusion ofdocument sentences in the summary. Variationsof supervised models have been utilized forsummary generation, such as: maximum entropy(Osborne, 2002), HMM (Conroy et al., 2011),CRF (Galley, 2006; Shen et al., 2007; Chali andHasan, 2012), SVM (Xie and Liu, 2010), logisticregression (Louis et al., 2010) and reinforcementlearning (Rioux et al., 2014). Problems withsupervised models in context of summarizationinclude the need for large amount of annotateddata and domain dependency.

Graph based models have shown promisingresults for text summarization. In theseapproaches, the goal is to find the most centralsentences in the document by constructing agraph in which nodes are sentences and edgesare similarity between these sentences. Examplesof these techniques include LexRank (Erkanand Radev, 2004), TextRank (Mihalcea andTarau, 2004), and the work by (Paul et al.,2010). Maximizing the novelty and preventing

391

the redundancy in a summary is addressedby greedy selection of content summarization(Carbonell and Goldstein, 1998; Guo and Sanner,2010; Lin et al., 2010). Rhetorical structureof the documents have also been investigatedfor automatic summarization. In this line ofwork, dependency and discourse parsing basedon Rhetorical Structure Theory (RST) (Mannand Thompson, 1988) is used for analyzing thestructure of the documents (Hirao et al., 2013;Kikuchi et al., 2014; Yoshida et al., 2014).Summarization based on rhetorical structure isbetter suited for shorter documents and is highlydependent on the quality of the discourse parserthat is used. Training the discourse parserrequires large amount of training data in the RSTframework.

Scientific article summarization was firststudied by (Teufel and Moens, 2002) in whichthey trained a supervised Naive Bayes classifierto select informative content for the summary.Later (Elkiss et al., 2008) argue the benefits ofcitations to scientific work analysis. (Cohan etal., 2015) use a search oriented approach forfinding relevant parts of the reference paper tocitations. (Qazvinian and Radev, 2008; Qazvinianet al., 2013) use citations to an article to constructits summary. More specifically, they performhierarchical agglomerative clustering on citationsto maximize purity and select most centralsentences from each cluster for the final summary.Our work is closest to (Qazvinian and Radev,2008) with the difference that they only makeuse of citations. While citations are useful forsummarization, relying solely on them mightnot accurately capture the original context ofthe referenced paper. That is, the generatedsummary lacks the appropriate evidence toreflect the content of the original paper, such ascircumstances, data and assumptions under whichcertain findings were obtained. We address thisshortcoming by leveraging the citation-contextand the inherent discourse model in the scientificarticles.

3 The summarization approach

Our scientific summary generation algorithm iscomposed of four steps: (1) Extracting thecitation-context, (2) Grouping citation-contexts,(3) Ranking the sentences within each group and(4) Selecting the sentences for final summary. We

assume that the citation text (the text span inthe citing article that references another article)in each citing article is already known. Wedescribe each step in the following sub-sections.Our proposed method generates a summary ofan article with the premise that the article hasa number of citations to it. We call the articlethat is being referenced the “reference article”.We shall note that we tokenized the articles’ textto sentences by using the punkt unsupervisedsentence boundary detection algorithm (Kiss andStrunk, 2006). We modified the original sentenceboundary detection algorithm to also account forbiomedical abbreviations. For the rest of thepaper, “sentence” refers to units that are outputof the sentence boundary detection algorithm,whereas “text span” or in short “span” can consistof multiple sentences.

3.1 Extracting the citation-contextAs described in section 2, one problem withexisting citation based summarization approachesis that they lack the context of the referencedpaper. Therefore, our goal is to leverage citation-context in the reference article to correctly reflectthe reference paper. To find citation-contexts, weconsider each citation as an n-gram vector and usevector space model for locating the relevant textspans in the reference article. More specifically,given a citation c, we return the ranked list oftext spans r1, r2, ..., rn which have the highestsimilarity to c. We call the retrieved text spansreference spans. These reference spans areessentially forming the context for each citation.The similarity function is the cosine similaritybetween the pivoted normalized vectors. Weevaluated four different approaches for formingthe citation vector.

1. All terms in citation except for stopwords,numeric values and citation markers i.e., nameof authors or numbered citations. In figure 1 anexample of citation marker is shown.

2. Terms with high inverted document frequency(idf). Idf values of terms have shown to be a goodestimate of term informativeness.

3. Concepts that are represented through nounphrases in the citation, for example in thefollowing: “ ... typically achieved by introducingDNA tumor virus oncoproteins such a ... ” whichis part of a citation, the phrase “DNA tumor virusoncoproteins” is a noun phrase.

392

4. Biomedical concepts and noun phrasesexpanded by related biomedical concepts: Thisformation is specific to the biomedical domain.It selects biomedical concepts and noun phrasesin the citation and uses related biomedicalterminology to expand the citation vector.We used Metamap1 for extracting biomedicalconcepts from the citation text (which is a toolfor mapping free form text to UMLS2 concepts).For expanding the citation vector using the relatedbiomedical terminology, we used SNOMED CT3

ontology by which we added synonyms of theconcepts in the citation text to the citation vector.

3.2 Grouping the citation-contextsAfter identifying the context for each citation, weuse them to form the summary. To capture variousimportant aspects of the reference article, we formgroups of citation-contexts that are about the sametopic. We use the following two approaches forforming these groups:

Community detection - We want to find diversekey aspects of the reference article. We formthe graph of extracted reference spans in whichnodes are sentences and edges are similaritybetween sentences. As for the similarity function,we use cosine similarity between tf-idf vectorsof the sentences. Similar to (Qazvinian andRadev, 2008), we want to find subgraphs orcommunities whose intra-connectivity is high butinter-connectivity is low. Such quality is capturedby the modularity measure of the graph (Newman,2006; Newman, 2012). Graph modularityquantifies the denseness of the subgraphs incomparison with denseness of the graph ofrandomly distributed edges and is defined asfollows:

Q =1

2m

∑vw

[Avw − kv × kw

2m]δ(cv, cw)

Where Avw is the weigh of the edge (v, w); kv isthe degree of the vertex v; cv is the community ofvertex v; δ is the Kronecker’s delta function andm =

∑vw Avw is the normalization factor.

While the general problem of precise partitioningof the graph into highly dense communities

1http://metamap.nlm.nih.gov/2Unified Medical Language System - a compendium of

controlled vocabularies in the biomedical sciences, http://www.nlm.nih.gov/research/umls

3http://www.nlm.nih.gov/research/umls/Snomed/snomed main.html

that optimizes the modularity is computationallyprohibitive (Brandes et al., 2008), many heuristicalgorithms have been proposed with reasonableresults. To extract communities from the graphof reference spans, we use the algorithm proposedby (Blondel et al., 2008) which is a simpleyet accurate and efficient community detectionalgorithm. Specifically, communities are built ina hierarchical fashion. At first, each node belongsto a separate community. Then nodes are assignedto new communities if there is a positive gainin modularity. This process is applied iterativelyuntil no further improvement in modularity ispossible.

Discourse model - A natural discourse model isfollowed in each scientific article. In this method,instead of finding communities to capture differentimportant aspects of the paper, we try to selectreference spans based on the discourse model ofthe paper. The discourse model is accordingto the following facets: “hypothesis”, “method”,“results”, “implication”, “discussion” and “data-set-used”. The goal is to ideally include referencespans from each of these discourse facets of thearticle in the summary to correctly capture allaspects of the article. We use a one-vs-rest SVMsupervised model with linear kernel to classifythe reference spans to their respective discoursefacets. Training was done on both the citationand reference spans since empirical evaluationshowed marginal improvements upon includingthe reference spans in addition to the citationitself. We use unigram and verb features with tf-idf weighting to train the classifier.

3.3 Ranking model

To identify the most representative sentences ofeach group, we require a measure of importanceof sentences. We consider the sentences in agroup as a graph and rank nodes based on theirimportance. An important node is a node thathas many connections with other nodes. Thereare various ways of measuring centrality of nodessuch as nodes degree, betweenness, closeness andeigenvectors. Here, we opt for eigenvectors andwe find the most central sentences in each groupby using the “power method” (Erkan and Radev,2004) which iteratively updates the eigenvectoruntil convergence.

393

3.4 Selecting the sentences for final summaryAfter scoring and ranking the sentences in eachgroup which were identified either by discoursemodel or by community detection algorithm, weemploy two strategies for generating the summarywithin the summary length threshold.• Iterative: We select top sentences iteratively from

each group until we reach the summary lengththreshold. That is, we first pick the top sentencefrom all groups and if the threshold is not met,we select the second sentence and so forth. In thediscourse based method, the following orderingfor selecting sentences from groups is used:“hypothesis”, “method”,“results”, “implication”and “discussion”. In the community detectionmethod, no pre-determined order is specified.• Novelty: We employ a greedy strategy similar to

MMR (Carbonell and Goldstein, 1998) in whichsentences from each group are selected based onthe following scoring formula:

score(S) def=λSim1(S, D)

− (1− λ)Sim2(S,Summary)

Where, for each sentence S, the score is alinear interpolation of similarity of sentence withall other sentences (Sim1) and the similarityof sentence with the sentences already in thesummary (Sim2) and λ is a constant. Weempirically set λ = 0.7 and also selectedtop 3 central sentences from each group as thecandidates for the final summary.

4 Experimental setup

4.1 DataWe used the TAC2014 biomedical summarizationdataset for evaluation of our proposed method.The TAC2014 benchmark contains 20 topicseach of which consists of one reference articleand several articles that have citations to eachreference article (the statistics of the dataset isshown in Table 1). All articles are biomedicalpapers published by Elsevier. For each topic,4 experts in biomedical domain have written ascientific summary of length not exceeding 250words for the reference article. The data alsocontains annotated citation texts as well as thediscourse facets. The latter were used to build thesupervised discourse model. The distribution ofdiscourse facets is shown in Table 2.

Table 1: Dataset statisticsmean std

# of topics (reference articles) 20 0# of Gold summaries for each topic 4 0# of citing articles in each topic 15.65 2.70# of citations to the reference article in

each citing article1.57 1.17

Length of summaries (words) 235.64 31.24Length of articles (words) 9759.86 2199.48

Table 2: Distribution of annotated discourse facets

Discourse facet countHypothesis 21

Method 155Results 490

Implication 140Discussion 446

4.2 BaselinesWe compared existing well-known and widely-used approaches discussed in section 2 withour approach and evaluated their effectivenessfor scientific summarization. The first threeapproaches use the scientific article’s text and thelast approach uses the citations to the article forgenerating the summary.• LSA (Steinberger and Jezek, 2004) - The LSA

summarization method is based on singular valuedecomposition. In this method, a term documentindex A is created in which the values correspondto the tf-idf values of terms in the document.Then, Singular Value Decomposition, a dimensionreduction approach, is applied to A. This willyield a singular value matrix Σ and a singularvector matrix VT . The top singular vectors areselected from VT iteratively until length of thesummary reaches a predefined threshold.• LexRank (Erkan and Radev, 2004) - LexRank

uses a measure called centrality to find the mostrepresentative sentences in given sets of sentences.It finds the most central sentences by updating thescore of each sentence using an algorithm basedon PageRank random walk ranking model (Page etal., 1999). More specifically, the centrality scoreof each sentence is represented by a centralitymatrix p which is updated iteratively through thefollowing equation using a method called “powermethod”:

p = AT p

Where matrix A is based on the similarity matrixB of the sentences:

A = [dU + (1− d)B]

394

In which U is a square matrix with values 1/N andd is a parameter called the damping factor. We setd to 0.1 which is the default suggested value.• MMR (Carbonell and Goldstein, 1998) - In

Maximal Marginal Relevance (MMR), sentencesare greedily ranked according to a score based ontheir relevance to the document and the amountof redundant information they carry. It scoressentences based on the maximization of the linearinterpolation of the relevance to the document anddiversity:

MMR(S,D) def=λSim1(S, D)

− (1− λ)Sim2(S,Summary)

Where S is the sentence being evaluated, D isthe document being summarized, Sim1 and Sim2

are similarity function, Summary is the summaryformed by the previously selected sentences andλ is a parameter. We used cosine similarity assimilarity functions and we set λ to 0.3, 0.5 and0.7 for observing the effect of informativeness vs.novelty.• Citation summary (Qazvinian and Radev, 2008)-

In this approach, a network of citations is builtand citations are clustered to maximum purity(Zhao and Karypis, 2001) and mutual information.These clusters are then used to generate the finalsummary by selecting the top central sentencesfrom each cluster in a round-robin fashion. Ourapproach is similar to this work in that they alsouse centrality scores on citation network clusters.Since they only focus on citations, comparison ofour approach with this work gives a better insightinto how beneficial our use of citation-contextand article’s discourse model can be in generatingscientific summaries.

5 Results and discussions

5.1 Evaluation metricsWe use the ROUGE evaluation metrics whichhas shown consistent correlation with manuallyevaluated summarization scores (Lin, 2004). Morespecifically, we use ROUGE-L, ROUGE-1 andROUGE-2 to evaluate and compare the quality ofthe summaries generated by our system. WhileROUGE-N focuses on n-gram overlaps, ROUGE-Luses the longest common subsequence to measurethe quality of the summary. ROUGE-N where N

is the n-gram order, is defined as follows:

ROUGE-N =

∑S∈{Gold summaries}

∑W∈S

fmatch(W )∑S∈{Gold summaries}

∑W∈S

f(W )

WhereW is the n-gram, f(.) is the count function,fmatch(.) is the maximum number of n-grams co-occurring in the generated summary and in a set ofgold summaries. For a candidate summary C withn words and a gold summary S with u sentences,ROUGE-L is defined as follows:

ROUGE-Lrec =

u∑i=1

LCS∪(ri, C)∑ui=1 |ri|

ROUGE-Lprec =

u∑i=1

LCS∪(ri, C)

n

Where LCS∪(., .) is the Longest commonsubsequence (LCS) score of the union of LCSbetween gold sentence ri and the candidatesummary C. ROUGE-L f score is the harmonicmean between precision and recall.

5.2 Comparison between summarizersWe generated two sets of summaries using themethods and baselines described in previoussections. We consider short summaries oflength 100 words and longer summaries oflength 250 words (which corresponds to thelength threshold in gold summaries). We alsoconsidered the oracle’s performance by averagingover the ROUGE scores of all human summariescalculated by considering one human summaryagainst others in each topic. As far as 100words summaries, since we did not have goldsummaries of that length, we considered the first100 words from each gold summary. Figure 2shows the box-and-whisker plots with ROUGE

scores. For each metric, the scores of eachsummarizer in comparison with the baselines for100 word summaries and 250 words summariesare shown. The citation-context for all themethods were identified by the citation text vectormethod which uses the citation text except fornumeric values, stop words and citation markers(first method in section 3.1). In section 5.3,we analyze the effect of various citation-contextextraction methods that we discussed in section 3

395

Figure 2: ROUGE-1, ROUGE-2 and ROUGE-L scores for different summarization approaches. Chartreuse (yellowishgreen) box shows the oracle, green boxes show the proposed summarizers and blue boxes show the baselines; From left,Oracle; Citation-Context-Comm-It: Community detection on citation-context followed by iterative selection; Citation-Context-Community-Div: Community detection on citation-context followed by relevance and diversification in sentence selection;Citation-Context-Discourse-Div: Discourse model on citation-context followed by relevance and diversification; Citation-Context-Discourse-It: Discourse model on citation-context followed by iterative selection; Citation Summ.: Citation summary;MMR 0.3: Maximal marginal relevance with λ = 0.3.

396

on the final summary. The name of eachof our methods is shortened by the followingconvention: [Summarization approach] [Sentenceselection strategy]. Summarization approach isbased on either community detection (Citation-Context-Comm) or discourse model of the article(Citation-Context-Disc) and sentence selectionstrategy can be iterative (It) or by relevance anddiversification (Div).

We can clearly observe that our proposedmethods achieve encouraging results incomparison with existing baselines. Specifically,for 100 words short summaries, the discoursebased method (with 34.6% mean ROUGE-Limprovement over the best baseline) and for 250word summaries, the community based method(with 3.5% mean ROUGE-L improvement over thebest baseline) are the best performing methods.We observe relative consistency between differentrouge scores for each summarization approach.Grouping citation-context based on both thediscourse structure and the communities showcomparable results. The community detectionapproach is thus effectively able to identifydiverse aspects of the article. The discoursemodel of the scientific article is also able todiversify selection of citation contexts for the finalsummary. These results confirm our hypothesesthat using the citation context along with thediscourse model of the scientific articles can helpproducing better summaries.

Comparison of performance of methods onindividual topics showed that the citation-contextmethods consistently over perform all othermethods in most of the topics (65% of all topics).

While the discourse approach showsencouraging results, we attribute its limitationin achieving higher ROUGE scores to theclassification errors that we observed in intrinsicclassification evaluation. In evaluating theperformance of several classifiers, linear SVMachieved the highest performance with accuracyof 0.788 in comparison with human annotationperformance. Many of the citations cannot exactlybelong to only one of the discourse facets ofthe paper and thus some errors in classificationare inevitable. This is also observable indisagreements between the annotators in labelingas reported by (Cohan et al., 2014). This factinfluences the diversification and finally thesummarization quality.

Among baseline summarization approaches,LexRank performs relatively well. Itsperformance is the best for short summariesamong other baselines. This is expected sinceLexRank tries to find the most central sentences.When the length of the summary is short, themain idea in the summary is usually captured byfinding the most representative sentence whichLexRank can effectively achieve. However, thesentences that it chooses are usually about thesame topic. Hence, the diversity in the goldsummaries is not considered. This becomes morevisible when we observe 250 word summaries.Our discourse based method can overcome thisproblem by including important contents fordiverse discourse facets (34.6% mean ROUGE-Limprovement for 100 words summaries and13.9% improvement for 250 word summaries).The community based approach achieves thesame diversification effect in an unsupervisedfashion by forming citation-context communities(27.16% mean ROUGE-L improvement for 100words summaries and 14.9% improvement for250 word summaries).

The citation based summarization baselinehas somewhat average performance among thebaseline methods. This confirms that relying onlyon the citations can not be optimal for scientificsummarization. While LSA approach performsrelatively well, we observe lower scores for allvariations of MMR approaches. We attributethe low performance of MMR to its sub optimalgreedy selection of sentences from relatively longscientific articles.

By comparing the two sentence selectionapproaches (i.e., iterative and diversification-relevance), we observe that while for shorterlength summaries the method based ondiversification performs better, for the longersummaries results for the two methods arecomparable. This is because when the lengththreshold is smaller, iterative approach may failto select best representative sentences from allthe groups. It essentially selects one sentencefrom each group until the length threshold is met,and consequently misses some aspects. Whereas,the diversification method selects sentences thatmaximize the gain in informativeness and atthe same time contributes to the novelty of thesummary. In longer summaries, due to largerthreshold, iterative approach seems to be able

397

Figure 3: Comparison of the effect of different citation-context extraction methods on the quality of the final summary.

to select the top sentences from each group,enabling it to reflect different aspect of thepaper. Therefore, the iterative approach performscomparably well to the diversification approach.This outcome is expected because the number ofgroups are small. For discourse method, there are5 different discourse facets and for communitymethod, on average 5.2 communities are detected.Hence, iterative selection can select sentencesfrom most of these groups within 250 words limitsummaries.

5.3 Analysis of strategies for citation-contextextraction

Figure 3 shows ROUGE-L results for 250 wordssummaries based on using different citation-context extraction approaches, described insection 3.1. Relatively comparable performancefor all the approaches is achieved. Usingthe citation text for extracting the context isalmost as effective as other methods. Keywordsapproach which uses the terms with high idfvalues for locating the context achieves slightlyhigher Rouge-L precision while it has the lowestrecall. This is expected since keywords approachchooses only informative terms for extractingcitation-contexts. This results in missing termsthat may not be keywords by themselves buthelp providing meaning. Noun phrases hasthe highest mean F-score and thus suggests thefact that noun phrases are good indicators ofimportant concepts in scientific text. We attributethe high recall of noun phrases to the fact thatmost important concepts are captured by onlyselecting noun phrases. Interestingly, introducingbiomedical concepts and expanding the citationvector by related concepts does not improve

the performance. This approach achieves arelatively higher recall but a lower mean precision.While capturing domain concepts along with nounphrases helps improving the performance, addingrelated concepts to the citation vector causesdrift from the original context as expressed inthe reference article. Therefore some decline inperformance is incurred.

6 Conclusion

We proposed a pipeline approach forsummarization of scientific articles whichtakes advantage of the article’s inherent discoursemodel and citation-contexts extracted from thereference article1. Our approach focuses on theproblem of lack of context in existing citationbased summarization approaches. We effectivelyachieved improvement over several well knownsummarization approaches on the TAC2014biomedical summarization dataset. That is, in allcases we improved over the baselines; in somecases we obtained greater than 30% improvementfor mean ROUGE scores over the best performingbaseline. While the dataset we use for evaluationof scientific articles is in biomedical domain,most of our approaches are general and thereforeadaptable to other scientific domains.

Acknowledgments

The authors would like to thank the threeanonymous reviewers for their valuable feedbackand comments. This research was partiallysupported by National Science Foundation (NSF)under grant CNS-1204347.

1Code can be found at: https://github.com/acohan/scientific-summ

398

ReferencesTaylor Berg-Kirkpatrick, Dan Gillick, and Dan Klein.

2011. Jointly learning to extract and compress.In Proceedings of the 49th Annual Meeting of theAssociation for Computational Linguistics: HumanLanguage Technologies-Volume 1, pages 481–490.Association for Computational Linguistics.

Vincent D Blondel, Jean-Loup Guillaume, RenaudLambiotte, and Etienne Lefebvre. 2008. Fastunfolding of communities in large networks.Journal of Statistical Mechanics: Theory andExperiment, 2008(10):P10008.

Ulrik Brandes, Daniel Delling, Marco Gaertler,Robert Gorke, Martin Hoefer, Zoran Nikoloski,and Dorothea Wagner. 2008. On modularityclustering. Knowledge and Data Engineering, IEEETransactions on, 20(2):172–188.

Jaime Carbonell and Jade Goldstein. 1998. The useof mmr, diversity-based reranking for reorderingdocuments and producing summaries. In SIGIR,pages 335–336. ACM.

Asli Celikyilmaz and Dilek Hakkani-Tur. 2010. Ahybrid hierarchical model for multi-documentsummarization. In ACL, pages 815–824.Association for Computational Linguistics.

Asli Celikyilmaz and Dilek Hakkani-Tur. 2011.Discovery of topically coherent sentences forextractive summarization. In ACL: HLT-Volume1, pages 491–499. Association for ComputationalLinguistics.

Yllias Chali and Sadid a. Hasan. 2012. Query-focusedmulti-document summarization: Automatic dataannotations and supervised learning approaches.Nat. Lang. Eng., 18(1):109–145, January.

James Clarke and Mirella Lapata. 2008. Globalinference for sentence compression an integerlinear programming approach. J. Artif. Int. Res.,31(1):399–429, March.

Arman Cohan, Luca Soldaini, and Nazli Goharian.2014. Towards citation-based summarization ofbiomedical literature. Proceedings of the TextAnalysis Conference (TAC ’14).

Arman Cohan, Luca Soldaini, and Nazli Goharian.2015. Matching citation text and cited spansin biomedical literature: a search-orientedapproach. In Proceedings of the 2015 NAACL-HLT,pages 1042–1048. Association for ComputationalLinguistics.

John M Conroy, Judith D Schlesinger, Jeff Kubina,Peter A Rankel, and Dianne P OLeary. 2011. Classy2011 at tac: Guided and multi-lingual summariesand evaluation metrics. In Proceedings of the TextAnalysis Conference.

Anita De Waard and Henk Pander Maat. 2012.Epistemic modality and knowledge attributionin scientific discourse: A taxonomy of typesand overview of features. In Proceedingsof the Workshop on Detecting Structure inScholarly Discourse, pages 47–55. Associationfor Computational Linguistics.

Aaron Elkiss, Siwei Shen, Anthony Fader, GunesErkan, David States, and Dragomir Radev. 2008.Blind men and elephants: What do citationsummaries tell us about a research article? Journalof the American Society for Information Science andTechnology, 59(1):51–62.

Gunes Erkan and Dragomir R Radev. 2004.Lexrank: Graph-based lexical centrality as saliencein text summarization. J. Artif. Intell. Res.(JAIR),22(1):457–479.

Michel Galley. 2006. A skip-chain conditionalrandom field for ranking meeting utterancesby importance. In EMNLP, pages 364–372.Association for Computational Linguistics.

Yihong Gong and Xin Liu. 2001. Generic textsummarization using relevance measure and latentsemantic analysis. In Proceedings of the 24thannual international ACM SIGIR conference onResearch and development in information retrieval,pages 19–25. ACM.

Shengbo Guo and Scott Sanner. 2010. Probabilisticlatent maximal marginal relevance. In SIGIR, pages833–834. ACM.

Aria Haghighi and Lucy Vanderwende. 2009.Exploring content models for multi-documentsummarization. In NAACL-HLT, pages 362–370.Association for Computational Linguistics.

Tsutomu Hirao, Yasuhisa Yoshida, Masaaki Nishino,Norihito Yasuda, and Masaaki Nagata. 2013.Single-document summarization as a tree knapsackproblem. In EMNLP, pages 1515–1520.

Yuta Kikuchi, Tsutomu Hirao, Hiroya Takamura,Manabu Okumura, and Masaaki Nagata. 2014.Single document summarization based on nestedtree structure. In Proceedings of the 52nd AnnualMeeting of the Association for ComputationalLinguistics, volume 2, pages 315–320.

Tibor Kiss and Jan Strunk. 2006. Unsupervisedmultilingual sentence boundary detection.Computational Linguistics, 32(4):485–525.

Ju-Hong Lee, Sun Park, Chan-Min Ahn, and DaehoKim. 2009. Automatic generic documentsummarization based on non-negative matrixfactorization. Information Processing &Management, 45(1):20 – 34.

Jiwei Li and Sujian Li. 2014. A novel feature-basedbayesian model for query focused multi-documentsummarization. Transactions of the Association forComputational Linguistics, 1:89–98.

399

Jimmy Lin, Nitin Madnani, and Bonnie J Dorr.2010. Putting the user in the loop: interactivemaximal marginal relevance for query-focusedsummarization. In NAACL-HLT, pages 305–308.Association for Computational Linguistics.

Chin-Yew Lin. 2004. Rouge: A package forautomatic evaluation of summaries. In TextSummarization Branches Out: Proceedings of theACL-04 Workshop, pages 74–81.

Annie Louis, Aravind Joshi, and Ani Nenkova.2010. Discourse indicators for content selection insummarization. In Proceedings of the 11th AnnualMeeting of the Special Interest Group on Discourseand Dialogue, pages 147–156. Association forComputational Linguistics.

Tengfei Ma and Hiroshi Nakagawa. 2013.Automatically determining a proper length formulti-document summarization: A bayesiannonparametric approach. In EMNLP, pages 736–746. Association for Computational Linguistics.

William C Mann and Sandra A Thompson. 1988.Rhetorical structure theory: Toward a functionaltheory of text organization. Text, 8(3):243–281.

Rada Mihalcea and Paul Tarau. 2004. Textrank:Bringing order into texts. Association forComputational Linguistics.

Mark EJ Newman. 2006. Modularity and communitystructure in networks. Proceedings of the NationalAcademy of Sciences, 103(23):8577–8582.

MEJ Newman. 2012. Communities, modules andlarge-scale structure in networks. Nature Physics,8(1):25–31.

Miles Osborne. 2002. Using maximum entropyfor sentence extraction. In Proceedings of theACL-02 Workshop on Automatic Summarization-Volume 4, pages 1–8. Association for ComputationalLinguistics.

Makbule Gulcin Ozsoy, Ilyas Cicekli, and Ferda NurAlpaslan. 2010. Text summarization of turkishtexts using latent semantic analysis. In COLING,pages 869–876. Association for ComputationalLinguistics.

Lawrence Page, Sergey Brin, Rajeev Motwani, andTerry Winograd. 1999. The pagerank citationranking: Bringing order to the web.

Michael Paul, ChengXiang Zhai, and Roxana Girju.2010. Summarizing contrastive viewpoints inopinionated text. In EMNLP, pages 66–76.Association for Computational Linguistics.

Vahed Qazvinian and Dragomir R Radev. 2008.Scientific paper summarization using citationsummary networks. In Proceedings of the 22ndInternational Conference on ComputationalLinguistics-Volume 1, pages 689–696. Associationfor Computational Linguistics.

Vahed Qazvinian, Dragomir R Radev, SaifMohammad, Bonnie J Dorr, David M Zajic,Michael Whidby, and Taesun Moon. 2013.Generating extractive summaries of scientificparadigms. J. Artif. Intell. Res.(JAIR), 46:165–201.

Cody Rioux, A. Sadid Hasan, and Yllias Chali.2014. Fear the reaper: A system for automaticmulti-document summarization with reinforcementlearning. In EMNLP, pages 681–690. Associationfor Computational Linguistics.

Alan Ritter, Colin Cherry, and Bill Dolan.2010. Unsupervised modeling of twitterconversations. In NAACL-HLT, HLT ’10, pages172–180, Stroudsburg, PA, USA. Association forComputational Linguistics.

Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, andZheng Chen. 2007. Document summarization usingconditional random fields. In IJCAI, volume 7,pages 2862–2867.

Josef Steinberger and Karel Jezek. 2004. Usinglatent semantic analysis in text summarization andsummary evaluation. In Proc. ISIM04, pages 93–100.

Josef Steinberger, Mijail A Kabadjov, MassimoPoesio, and Olivia Sanchez-Graillet. 2005.Improving lsa-based summarization with anaphoraresolution. In EMNLP-HLT, pages 1–8. Associationfor Computational Linguistics.

Simone Teufel and Marc Moens. 2002. Summarizingscientific articles: Experiments with relevance andrhetorical status. Comput. Linguist., 28(4):409–445,December.

Lucy Vanderwende, Hisami Suzuki, Chris Brockett,and Ani Nenkova. 2007. Beyond sumbasic: Task-focused summarization with sentence simplificationand lexical expansion. Information Processing &Management, 43(6):1606–1618.

Kristian Woodsend and Mirella Lapata. 2012.Multiple aspect summarization using integer linearprogramming. In EMNLP, pages 233–243.Association for Computational Linguistics.

Shasha Xie and Yang Liu. 2010. Improving supervisedlearning for meeting summarization using samplingand regression. Computer Speech & Language,24(3):495 – 514. Emergent Artificial IntelligenceApproaches for Pattern Recognition in Speech andLanguage Processing.

Yasuhisa Yoshida, Jun Suzuki, Tsutomu Hirao,and Masaaki Nagata. 2014. Dependency-based discourse parser for single-documentsummarization. In EMNLP, pages 1834–1839, Doha, Qatar, October. Association forComputational Linguistics.

Ying Zhao and George Karypis. 2001. Criterionfunctions for document clustering: Experiments andanalysis. Technical report, Citeseer.

400

Documents

Scientific Article Summarization Using Citation-Context