Upload
others
View
15
Download
0
Embed Size (px)
Citation preview
Calculating LLR Topic Signatures with Dependency Relationsfor Automatic Text Summarization
Prescott P. Klassen
A thesis submitted in partial fulfillment of therequirements for the degree of
Master of Science
University of Washington
2012
Program Authorized to Offer Degree:Computational Linguistics
University of Washington
Abstract
Calculating LLR Topic Signatures with Dependency Relations for Automatic TextSummarization
Prescott P. Klassen
Chair of the Supervisory Committee:Professor Fei Xia
Linguistics
Topic Signatures based on Log Likelihood Ratio (LLR) values have been a staple of Au-
tomatic Text Summarization since originally proposed over a decade ago. In my thesis
I propose an alternate method for counting information units and calculating the fore-
ground and background probabilities for LLR calculations based on the participation
of an information unit in dependency relations generated from a sentence rather than
the sentence itself. I develop a generic text summarization system based on the Text
Analysis Conference shared task guidelines and data in order to compare the proposed
method of counting with the standard approach in the context of an applied task. Each
counting method and unit of information definition is run as an experiment on TAC
2010 and TAC 2011 topic-based document collections and evaluated against human
model summaries using ROUGE statistical measure of n-gram overlap. Although the
results of the experiments are inconclusive, the topic signatures generated by the two
approaches are different in the information units they contain. I conclude that an al-
ternate evaluation framework and a semi-abstractive approach leveraging dependency
relations themselves for summary generation are possible areas for future work and
research.
TABLE OF CONTENTS
Page
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2: Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Overview of Summarization Systems . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Single or Multiple Document . . . . . . . . . . . . . . . . . . . . . 62.1.2 Extractive or Abstractive . . . . . . . . . . . . . . . . . . . . . . . . 62.1.3 Generic or Query-focused . . . . . . . . . . . . . . . . . . . . . . . 72.1.4 General or Domain Specific . . . . . . . . . . . . . . . . . . . . . . 72.1.5 Initial or Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Early Extractive Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Frequency-based Approaches to Sentence Extraction . . . . . . . . . . . 10
2.3.1 Word Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.2 Term Frequency/Inverse Document Frequency . . . . . . . . . . . 112.3.3 Log Likelihood Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Abstractive approaches to text summarization . . . . . . . . . . . . . . . 152.4.1 SUMMONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.2 RALI-DIRO at TAC 2010 . . . . . . . . . . . . . . . . . . . . . . . . 162.4.3 Human EXtraction for TAC: HexTac . . . . . . . . . . . . . . . . . 16
2.5 Summarization Shared Tasks . . . . . . . . . . . . . . . . . . . . . . . . . 172.5.1 Document Understanding Conference . . . . . . . . . . . . . . . . 172.5.2 Text Analysis Conference . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 DUC/TAC Evaluation Frameworks . . . . . . . . . . . . . . . . . . . . . . 202.6.1 Pyramid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.6.2 Readability/Fluency and Responsiveness . . . . . . . . . . . . . . 212.6.3 Recall-Oriented Understudy for Gisting Evaluation (ROUGE) . . 21
i
2.6.4 Basic Elements (BE) . . . . . . . . . . . . . . . . . . . . . . . . . . 222.7 Natural Language Processing Software Libraries . . . . . . . . . . . . . 22
Chapter 3: Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.1 Topic Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 Calculating LLR with Dependency Relations . . . . . . . . . . . . . . . . 283.3 TAC 2011 Guided Summarization Task . . . . . . . . . . . . . . . . . . . 30
3.3.1 TAC Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.4 Development and Testing Data . . . . . . . . . . . . . . . . . . . . . . . . 313.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.6 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.6.2 Stanford coreNLP Annotation Pipeline . . . . . . . . . . . . . . . 353.6.3 Information Unit Extraction . . . . . . . . . . . . . . . . . . . . . . 353.6.4 LLR Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.6.5 Sentence Selection and Ranking . . . . . . . . . . . . . . . . . . . 363.6.6 Summary Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Chapter 4: Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.1 Systems Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2 System I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.1 System I Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 424.3 System II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.1 System II Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 464.4 System III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4.1 System III Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 484.5 Comparison of Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Chapter 5: Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . 505.1 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.2 TAC 2010 ROUGE Average F-measure Results . . . . . . . . . . . . . . . 525.3 TAC 2011 ROUGE Average F-measure Results . . . . . . . . . . . . . . . 53
Chapter 6: Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 566.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
ii
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Appendix A: Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68A.1 TAC 2010 and 2011 Experiments: Example Summaries and Topic Sig-
natures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68A.1.1 TAC 2010 Summary A Experiments . . . . . . . . . . . . . . . . . 68A.1.2 TAC 2010 Summary B Experiments . . . . . . . . . . . . . . . . . 74A.1.3 TAC 2011 Summary A Experiments . . . . . . . . . . . . . . . . . 77A.1.4 TAC 2011 Summary B Experiments . . . . . . . . . . . . . . . . . 82
iii
LIST OF FIGURES
Figure Number Page
2.1 Basic Summarization Process . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1 System Program Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
iv
LIST OF TABLES
Table Number Page
2.1 Sample aspects from TAC annual shared-tasks . . . . . . . . . . . . . . . 7
3.1 Comparison of simple sentence-based probabilities calculated by wordcount and by participation in dependency role count . . . . . . . . . . . . 29
3.2 Text Analysis Conference 2011 Schedule . . . . . . . . . . . . . . . . . . . 313.3 TAC Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1 System I: Application components . . . . . . . . . . . . . . . . . . . . . . 394.2 System I Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.3 System II: Application Components . . . . . . . . . . . . . . . . . . . . . . 444.4 System II evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.5 System III evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.6 System Comparison: TAC 2010 Summary A/B ROUGE average F-measure 484.7 System Comparison: TAC 2011 Summary A/B ROUGE average F-measure 49
5.1 Description of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 515.2 Experiment Results: TAC 2010 Summary A ROUGE Average F-measures 525.3 Experiment Results: TAC 2010 Summary B ROUGE Average F-measures 535.4 Experiment Results: TAC 2011 Summary A ROUGE Average F-measures 545.5 Experiment Results: TAC 2011 Summary B ROUGE Average F-measures 55
A.1 Description of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 69A.2 TAC 2010 summary D1024F-A: best performing unit of information def-
inition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69A.3 TAC 2010 summary D1024F-A: run 79 topic signature (top 20) . . . . . 70A.4 TAC 2010 summary D1024F-A: run 80 topic signature (top 20) . . . . . 71A.5 TAC 2010 summary D1024F-A: run 81 topic signature (top 20) . . . . . 71A.6 Comparison of results for all TAC 2010 D1024F-A Summaries . . . . . . 72A.7 TAC 2010 summary D1023E-A: worst performing unit of information
definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72A.8 TAC 2010 summary D1023E-A: run 82 topic signature (top 20) . . . . . 73A.9 Comparison of results for all TAC 2010 D1023E-A Summaries . . . . . . 73
v
A.10 TAC 2010 summary D1002A-B: best performing unit of information def-inition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
A.11 TAC 2010 summary D1002A-B: run 77 topic signature (top 20) . . . . . 75A.12 Comparison of results for all TAC 2010 D1002A-B Summaries . . . . . . 75A.13 TAC 2010 summary D1030F-B: worst performing unit of information def-
inition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76A.14 TAC 2010 summaryD1030F-B: run 78 topic signature (top 20) . . . . . . 77A.15 Comparison of results for all TAC 2010 D1030F-B Summaries . . . . . . 77A.16 TAC 2011 summary D1126E-A: best performing unit of information def-
inition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78A.17 TAC 2011 summary D1126E-A: run 84 topic signature (top 20) . . . . . 79A.18 Comparison of results for all TAC 2011 D1126E-A Summaries . . . . . . 79A.19 TAC 2011 summary D1117C-A: worst performing unit of information
definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80A.20 TAC 2011 summary D1117C-A: run 80 topic signature (top 20) . . . . . 81A.21 TAC 2011 summary D1117C-A: run 82 topic signature (top 20) . . . . . 81A.22 TAC 2011 summary D1117C-A: run 84 topic signature (top 20) . . . . . 82A.23 Comparison of results for all TAC 2011 D1117C-A Summaries . . . . . . 82A.24 TAC 2011 summary D1120D-B: best performing unit of information def-
inition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83A.25 TAC 2011 summary D1120D-B: run 79 topic signature (top 20) . . . . . 84A.26 TAC 2011 summary D1120D-B: run 81 topic signature (top 20) . . . . . 84A.27 Comparison of results for all TAC 2011 D1120D-B Summaries . . . . . . 85A.28 TAC 2011 summary D1112C-B: worst performing unit of information
definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85A.29 TAC 2011 summary D1112C-B: run 77 topic signature (top 20) . . . . . 86
vi
ACKNOWLEDGMENTS
I would like to thank Dr. Fei Xia and Dr. Scott Farrar for their advice, patience,
and encouragement throughout the process of researching and writing my thesis. I
would also like to acknowledge David Brodbeck, our system administrator, for always
being available to help with systems and software issues, Joyce Parvi and Mike Furr
for all their help with administrative tasks, and Dr. Emily Bender for designing a
degree program that inspired me to make the life-changing transition from industry to
academia.
vii
1
Chapter 1
INTRODUCTION
For over 50 years, computers have been used to automatically generate summaries of
text documents. One of the earliest systems developed in 1958 by H.P. Luhn at IBM to
improve the quality of document abstracts, employed a statistical approach to the gen-
eration of summaries for scientific and technical journal articles (Luhn, 1958). It was
the first system to use the frequency of a word in a document as a measure of its impor-
tance as a descriptor of the document’s overall topic. Luhn demonstrated that words
within a high/low threshold could be used to rank the overall descriptiveness of sen-
tences in a document. Top-ranking sentences could then be automatically assembled
into a summary of the document. Luhn’s early approach established the fundamental
tasks of automatic extractive text summarization. Many new statistical approaches
for ranking and selecting sentences have been implemented in recent years, as well
as complex sentence editing and “smoothing” techniques to improve the readability of
assembled sentences, but at their core, the majority of systems extract sentences from
one or more documents to automatically create summaries.
An example of a more complex state-of-the-art, yet fundamentally extractive sys-
tem, is the Clustering, Linguistics, and Statistics for Summarization Yield (CLASSY)
system developed by the Institute for Defense Analysis (IDA)/Center for Computing
Sciences. The CLASSY system has been an annual participant in the Summariza-
tion track of the Text Analysis Conference (TAC) sponsored by the National Institute
for Standards and Technology (NIST) and its previous annual workshop, the Docu-
ment Understanding Conference (DUC). It is an example of an automatic text summa-
rization that has been incrementally improved over a period of ten years. Each year
it has performed either at the top or close to the top of all participating automated
summarization systems submitted to the annual DUC/TAC evaluations. CLASSY is
2
a multi-component solution that can be decomposed into seven different logical mod-
ules; (1) Complex data preparation using corpus-specific techniques, (2) Query term
selection and expansion, (3) Signature term selection using LLR and a significantly
large background corpus, (4) Sentence scoring using an approximate oracle, (5) Pro-
jection of term-sentence matrices against the base summary to reduce redundancy for
update summaries, (6) Redundancy removal and sentence selection via LSI/L1-QR al-
gorithm followed by an Integer Linear Program (ILP), and (7) Sentence ordering using
an approximate Traveling Salesperson Program (TSP).
In my thesis I compare two methods of calculating Log Likelihood Ratio (LLR) for
ranking and selecting sentences for extraction within the framework of a generic ex-
tractive text summarization system. Both approaches rely on topic signatures (Lin and
Hovy, 2000) to summarize a multi-document corpus. A topic signature is made up of
units of information that are considered statistically more likely to occur within a set of
documents about the same topic than in a larger, more general set of documents. The
statistical measure used to determine topic signatures is LLR (Dunning, 1993) which
is also referred to as G2 in statistical literature (Moore, 2004). In text summarization,
a topic signature is used to weight units of information in a sentences, sentences are
ranked based on the aggregate score of the weighted units of information they contain,
and the best sentences are extracted to form a summary. In the original application of
topic signatures, units of information were defined simply as words.
The two approaches that I compare differ in the method of counting units of infor-
mation in the calculation of an LLR. The first follows closely the approach described
by (Nenkova and McKeown, 2011) where foreground and background counts of units
of information are calculated by counting units as they occur in the sentences of the
documents in the corpus. The second is a novel approach I propose that counts units of
information as they occur in sentence-based dependency relations rather than in the
sentences themselves. The sentence-based dependency relations are created by the
Stanford coreNLP dependency parser and are represented in collapsed and propagated
form (Marneffe et al., 2006). I compare and contrast the two methods for counting us-
ing multiple definitions of information units including words, lemmatized case-neutral
3
words, lemmatized case-neutral words combined with a part-of-speech tag, and lem-
matized case-neutral words combined with a generalized part-of-speech tag restricted
to nouns, verbs, and adjectives.
The software systems used to compare the two counting methods were designed
based on the guided summarization task guidelines of the 2011 Text Analysis Con-
ference (TAC). They were developed and tested using TAC 2010 and TAC 2011 data.
Three generic text summarization systems were built and evaluated before a final
system was selected as the platform to compare the two counting methods and ex-
periment with multiple information unit definitions. The summaries generated by
System I were evaluated by both human assessors and automatic evaluation tools as
part of the official TAC 2011 lifecycle. The final set of summaries generated by System
II and System III were evaluated based solely on statistical measures of their over-
lap with human model summaries using the Recall-Oriented Understudy for Gisting
Evaluation (ROUGE) automatic evaluation tool and the 2010 and 2011 Text Analysis
Conference evaluation data.
Although the results of the experiments are inconclusive, the topic signatures gen-
erated by the two approaches are different in the information units they contain. I con-
clude that an alternate evaluation framework and a semi-abstractive approach lever-
aging dependency relations themselves for summary generation are possible areas for
future work and research.
My thesis is organized into six chapters: introduction, literature review, method-
ology, implementation, experiments and results, and conclusion and future work. In
my literature review I provide a high-level overview of the field of automatic text sum-
marization beginning with definitions and a brief review of early systems. I then de-
scribe select extractive statistical approaches with a focus on frequentist measures
and contrast the extractive with a description of the challenge of abstractive text sum-
marization and highlight select abstractive systems. To situate my generic extractive
systems, I provide an overview of the recent history of text summarization shared
tasks, the Document Understanding Conference (DUC) and the Text Analysis Con-
ference (TAC) as well as a summary of the evaluation frameworks and corpora that
4
have emerged from these conferences. I conclude with a survey of natural language
processing tool suites that enable the rapid development of text processing systems
and underpin the generic extractive summarization systems I developed for this the-
sis. The methodology chapter describes the statistical methods, algorithms, data, and
evaluation framework that I used to compare my two strategies as well as a descrip-
tion of the logical architecture, program flow, and open source components I integrated
into my solution. In the implementation chapter I detail the tactics and decisions that
were required when I built out my three systems and how these implementation de-
cisions required changes to my original design. The experiments and results chapter
describes the design, execution, and results of the experiments I set up to compare the
two strategies and multiple definitions of information units. The thesis concludes with
a conclusion and future work chapter that summarizes the results and conclusions of
my experiments and describes possible future work, focusing on the integration of ab-
stractive techniques to enhance the linguistic quality of the summarization system.
An appendix includes sample summaries and topic signatures for the best and worst
runs for individual topics in the experiments on TAC 2010 and 2011 data.
5
Chapter 2
LITERATURE REVIEW
The literature for automatic text summarization is extensive and covers many sub-
domains. In my literature review I will focus primarily on extractive and frequentist
approaches. I begin by defining important terms and a high-level overview of text
summarization including a brief description of early systems and their impact on the
domain. I then describe select extractive statistical approaches with a focus on fre-
quentist measures and contrast the extractive with a description of the challenge of
abstractive text summarization and highlight select abstractive systems. To situate
the generic extractive systems I built for my thesis, I provide an overview of the recent
history of text summarization shared tasks, the Document Understanding Conference
(DUC) and the Text Analysis Conference (TAC) as well as a summary of the evaluation
frameworks and corpora that have emerged from these conferences. I conclude with a
survey of natural language processing tool suites that enable the rapid development of
text processing systems and underpin the generic extractive summarization system I
developed for this thesis.
2.1 Overview of Summarization Systems
In her 1998 self-described “call to arms”, Sparck Jones defines a summary as a “re-
ductive transformation of source text to summary text through content reduction by
selection and/or generalization on what is important in the source” (Jones, 1998). She
outlines the basic process of summarization as a three step process: (1) source inter-
pretation to source text representation, (2) source representation transformation to
summary text representation, (3) summary text generation from summary represen-
tation. Although most systems decompose these three process stages into many more
subtasks and modules, the high-level process model described by Sparck Jones and
6
her definition of a summary can be applied to most automatic text summarization sys-
tems. They are differentiated by how they implement the process model, the amount
and scope of source text, the type of summary they create, and how they generate the
final text of the summary representation.
Automatic summarization systems can be categorized on multiple dimensions in-
cluding: (1) single or multiple document, (2) extractive or abstractive, (3) generic or
query-focused, (4) general or domain specific, (5) initial or update.
2.1.1 Single or Multiple Document
The first summarization systems were built to summarize a single document. As more
documents became digitized and access to large collections of text came online in the
1990s, summarization was expanded to multiple documents. An early news aggregator,
SUMMONS (SUMMarizing Online NewS articles), created a summary from a series
of Associated Press and Reuters newswire articles and is one of the few abstractive
systems in the automatic text summarization literature (McKeown and Radev, 1995).
Single document summarization is still a difficult task. In 2003 DUC retired the
single document summarization task because no system was able to beat the baseline
of the first sentence of an article (Nenkova and McKeown, 2011).
2.1.2 Extractive or Abstractive
The majority of summarization systems that have been developed are extractive. Ex-
tractive summarizers identify the best candidate sentences and use those sentences,
sometimes exactly as they appear in the original document, to create a summary. Cur-
rent state-of-the-art extractive summarizers usually employ some transformational
strategies to smooth or enhance the readability of the generated summary by prun-
ing, editing, or replacing parts of the extracted sentences. An abstractive summarizer,
like a human summarizer, will generate sentences based on an understanding of what
information is most important in the documents. Words and phrases in an abstrac-
tive summary may come from other sources other than the original documents, like a
7
lexical database, ontology, or language generation templates.
2.1.3 Generic or Query-focused
A generic summary is one that is generated without any query or context. The only
guide to the content of the summary is the input sentences. Many early multiple
document summarizers produced generic summaries. Query-focused summaries are
typically bounded by a query and are similar to open-ended question and answering
systems. Queries are typically one or more sentences of natural language. For exam-
ple, the DUC guided summarization task for 2007 was query-focused. The query was
defined as a topic title accompanied by one or more interrogative sentences concerning
specific aspects of the topic. Category-based aspect-oriented queries have been part of
the TAC guided summarization task since 2009. Topics are sorted into broad-based
categories (for example: Accidents or Natural Disasters) which have a template of as-
pects that are expected to be covered in the summary. One of the evaluation metrics
for aspect-oriented summaries is a measure of responsiveness, how many aspects were
covered in the summary. Table 2.1 features samples of aspects from TAC shared tasks.
Table 2.1: Sample aspects from TAC annual shared-tasks
Aspect Definition
WHAT what happened
WHEN date, time, other temporal placement markers
WHERE physical location
2.1.4 General or Domain Specific
Domain specific summarization systems focus on a specific domain like medical re-
search or scientific articles. Often domain specific knowledge bases or ontologies are
incorporated into the system to provide additional information to assist in sentence
selection or generation.
8
2.1.5 Initial or Update
An update summary provides only new information that was not originally summa-
rized by an initial update. The update task has been a component of DUC and TAC
since its pilot in DUC 2007.
2.2 Early Extractive Systems
The Automated Creation of Literature Abstracts (Luhn, 1958) by H.P. Luhn , is con-
sidered one of the earliest articles to explore using statistical measures and computer
software to automate the summary of a text document (Nenkova and McKeown, 2011).
Luhn describes an extractive summarization system for scientific journals and techni-
cal articles based on the insight that some words in a document are most descriptive of
its topic, and that sentences that contain those words are the best candidates to extract
to form a summary. He defines descriptive words as those words that occur within the
bounds of a low and high frequency threshold. Luhn makes the argument that words
that occur most frequently in a document are words that occur most frequently in all
documents and are not descriptive of the topic. Pronouns and determiners are exam-
ples of these kinds of frequent words and he excludes them by implementing a stop
word list. Words that occur too infrequently are also not indicative of the topic and are
excluded from the descriptive class. Finally, sentences are ranked based on the number
of descriptive words that occur within five-word clusters within a sentence, the highest
rank selected for inclusion in the summary. Luhn’s approach of using statistical mea-
sures of word frequency to extract sentences for summary generation is the foundation
on which many subsequent automatic text summarization systems were built.
Edmundson (Edmundson, 1969) expanded on the work of Luhn and introduced the
use of non-word features and training corpora into the development of extractive sum-
marization systems. He defined three features in addition to the number of times a
word occurs in a document to weight sentences: (1) the number of words in the sentence
that occur in the title or the section headings of the government, (2) position of the sen-
tence in the overall document and within a section, (3) the number of words within a
9
sentence that match a pre-compiled domain-specific list of cue words (Nenkova and
McKeown, 2011). Edmundsen also used a corpus of documents and summaries to both
determine feature weights and perform evaluation for his system.
Another early innovator was Paice (Paice, 1990), who addressed reference resolu-
tion, an inherent issue in extractive summary generation. Extractive systems select
the best representative sentences for a topic. The selected sentences are usually not
contiguous, leaving anaphora and cataphora unresolved, effecting the understanding
and readability of the generated summary. Paice proposed a template-based system
that matched exophora to a pre-built list in order to add sentences before or after the
selected sentence. He also described a system that would replace anaphora and cat-
aphora with the appropriate reference, but did not actually implement this solution
(Nenkova and McKeown, 2011).
Figure 2.1 illustrates the basic processes of extractive summarization systems.
Sentence Pre-Processing
Sentence Extraction
Summary Generation
Figure 2.1: Basic Summarization Process
10
2.3 Frequency-based Approaches to Sentence Extraction
Frequency-based approaches to sentence extraction are used in many unsupervised
summarization systems. The most basic form of frequency measure, raw frequency, is
biased by length of the document, so additional more complex measures of frequency
are typically used. Three frequentist approaches used in summarization are: (1) Word
probability, (2) Term Frequency/Inter-document Frequency (TF/IDF), (3) Log Likeli-
hood Ratio.
2.3.1 Word Probability
The word probability approach is the simplest measure and is based on the basic prob-
ability of a word w given the count of all words cw in the output N.
p(w) =c(w)
N(2.1)
The SumBasic system (Nenkova and Vanderwende, 2005) implements word probabil-
ity to assign weight to input sentences. Each sentence S is weighted by the average
probability of the content words p(w) it contains by the formula:
Weight(Sj) =
∑wi∈Sj
p(wi)
| {wi | wi ∈ Sj} |(2.2)
A stop word list is used to filter non-content words from the count. SumBasic selects
the highest scoring sentence containing the highest probable word. The assumption
is that the highest probable content word is indicative of the topic of the document
and a sentence containing this word as well as the highest average probability of other
content words is the best candidate for a summary. Based on the evidence that the
probability of a word occurring twice in a summary is less than the probability of a
word occurring only once in a summary (Nenkova and Vanderwende, 2005), the se-
lected sentence’s content words are re-ranked based on the square of their probability,
reducing the chance of duplicate content words are selected for the summary. The
selection process is repeated using the highest probable content word to rank subse-
11
quent candidate sentences. The process is repeated until the summary maximum word
length is reached.
The straightforward use of frequency in the SumBasic system performs surpris-
ingly well, ranking statistically among the top system in DUC 2004 and MSE 2005
(Nenkova and Vanderwende, 2005) using the ROUGE-1 measurement. Improvements
to the SumBasic approach are described in (tau Yih et al., 2007). In addition to fre-
quency the next iteration of the system uses both frequency and position, which are
combined by a discriminative machine learning-based algorithm and instead of heuris-
tic greedy sentence selection, an optimization process on the complete summary. The
improved system ranks by ROUGE-1 statistically at the top of DUC 2004 and MSE
2005 systems.
2.3.2 Term Frequency/Inverse Document Frequency
Term Frequency/Inverse Document Frequency (TF/IDF) (Salton and Buckley, 1988) is
a long-standing approach in information retrieval and text summarization for statis-
tically measuring the importance of a word in a document based on its proportional
frequency in a corpus (Jurafsky and Martin, 2009). The first component of TF/IDF is
term frequency and is a count of a term within a document normalized for length. This
value, term frequency, is then divided by the count of the number of documents which
contain the term, called inverse document frequency. A Log of the quotient is then used
as the weight of the term in the corpus. To compensate for terms that occur in zero or
one document in the corpus, one is added to the quotient to avoid a zero denominator).
The formula for term frequency is:
tfi,j =ni,j∑k nk,j
(2.3)
Where the number of occurrences of term i in document j represented by ni ,j and is
normalized by the number of terms in the whole document. The inverse document
12
frequency (Jones, 1972) is defined as:
idfi = log| D |
1+ | {d : ti ∈ d} |(2.4)
Where the total number of documents int the corpus is divided by the number of docu-
ments with the given term i. Each term in each document is given a TF/IDF score:
(tf/idf)i,j = tfi,j ×idfi (2.5)
TF/IDF weights are good indicators of which terms in a document are the most de-
scriptive content words in that document.
When a summary is query-focused, a Euclidean Distance or Cosine Similarity mea-
sure can be used to find the sentence vector with the smallest angle distance from
the query. Feature vectors for each document are created using term/tf/idf score fea-
ture/value pairs and compared to the query vector for each document using a cosine
similarity measure (Salton et al., 1975):
sim(q, dj) =
∑Ni=1wi,q xwi,j√∑N
i=1w2i,qx√∑N
i=1w2i,j
(2.6)
In (Hovy and Lin, 1999) the concept of term signatures is developed for an early ver-
sion of the Summarist system. TF/IDF weighting is used to find 300 signature content
words for 32 categories of documents in 30,000 texts of the Wall St. Journal. The
32 classes of 300 signature terms are then used to classify 2,204 unseen documents
from the same Wall St. Journal corpus using a cosine similarity measure. Precision of
0.69309 and recall of 0.75662 were reported, inline with information retrieval results
at the time. Although a summary generation module is only described and not imple-
mented in the paper, the classifying term signature used to classify the texts could also
be used to select and rank sentences based on smallest cosine angle between vectors of
signature and sentence in a hyperplane.
13
2.3.3 Log Likelihood Ratio
In his paper describing the challenge of finding an appropriate statistical model of
distribution for natural language processing, Dunning introduces the Log Likelihood
Ratio(LLR) (Dunning, 1993). He argues LLR is a good option for representing sparse
data distributions like that of words in a corpus. He equates counting words in a
corpus to a Bernoulli trial. Each test of a word matching a prototype has a probability
p and the probability of the next n matching the prototype is a random variable K in a
binomial distribution:
p(K = k) = pk(1− p)n−k(n
k
)(2.7)
whose mean is np and variance is np(1-p). Dunning demonstrates that if np(1-p) >
5 then the discrete binomial distribution approximates the continuous normal distri-
bution, however when np(1-p) < 5 and more so when it is np(1-p) < 1, the error when
approximating using the normal distribution gets larger. The nature of word frequency
is such that many words would occur rarely in a document. Because of this, Dunning
suggests another class of test that does not depend so much on normality, the log like-
lihood ratio. Moore refers to (Dunning, 1993) as introducing the NLP community to
this statistic and additionally labels it G2 log-likelihood-ratio (Moore, 2004).
First implemented for summarization as a statistic for calculating topic signatures
in (Lin and Hovy, 2000), LLR compares two hypotheses about the probability of a word
in a foreground corpus and the probability of the same word in a background corpus.
Hypothesis1 : P (w|I) = P (w|B)
Hypothesis2 : P (w|I) > P (w|B)(2.8)
where I is the foreground corpus and B is the background corpus. If Hypothesis 2 holds,
then w is descriptive of the foreground corpus. As described above, the probability p of
a word w is a Bernoulli trial and a binomial distribution:
p(K = k) = pk(1− p)n−k(n
k
)(2.9)
14
The two hypotheses are compared using a likelihood ratio of their probabilities:
λ =L(p, k1, n1)L(p, k2, n2)
L(p1, k1, n1)L(p2, k2, n2)(2.10)
where likelihood is calculated 1:
L(p, k, n) = pk(1− p)n−k (2.11)
and the probabilities are defined as
p1 =k1n1, p2 =
k2n2, p =
k1 + k2n1 + n2
(2.12)
where k1 is the count of a word within the foreground, n1 is the total number of words
in the foreground, k2 is the count of the word within the background corpus, and n2 is
the total number of words in the background corpus. Dunning reduces the formula to
calculate for −2logλ as:
−2logλ = 2[logL(p1, k1, n1) + logL(p2, k2, n2)
−logL(p, k1, n1)− logL(p, k2, n2)](2.13)
which correlates to the statistic chi-squared. The chi-squared distribution model can
be used to establish statistical thresholds for determining topic signatures. In (Lin and
Hovy, 2000) the cut off weight for −2logλ was set at 10.83 with confidence level α t =
0.001 (chi-squared lookup).
In (Moore, 2004), the relationship between G2 and Mutual Information is explored
by creating derivations of Dunnings’ original formula. The last derivation demon-
strates that G2 equals 2N times the formula for the average Mutual Information of
two random variables. The correlation with mutual information means that G2 two
important characteristics, (1) it can be used to measure word association independent
of significance, (2) like mutual information, it is independent of corpus size, and can be
used in corpus of varying size. See (Moore, 2004) for alternate formulas for α.
1because the numerator and denominator have the same binomial coefficients in them, they are can-celed out and are removed from the formula, greatly simplifying its calculation
15
In 2000, a new version of the Summarist topic signature module is implemented us-
ing LLR instead of TF/IDF (Lin and Hovy, 2000). Lin and Hovy compare their TF/IDF
and LLR-based systems and conclude that the LLR is the best performing solution.
Their paper is the first to comprehensively describe Dunning’s LLR measure as it re-
lates to automatic text summarization.
2.4 Abstractive approaches to text summarization
The DUC and TAC evaluation results of the last 11 years underscore the superiority of
human generated summaries over automated text summaries especially in regards to
linguistic quality and readability. A human summarizer is able to synthesize informa-
tion and introduce new terms and phrases to achieve a level of abstraction that cannot
be achieved by automated systems. Abstractive summarization systems attempt to
get closer to the quality of human summarizers by incorporating semantics, discourse
theory, and language generation.
2.4.1 SUMMONS
The SUMMONS (SUMMarizing Online NewS articles) news aggregator is an exam-
ple of an early abstractive system for multi-document text summarization. It used
event and activity content templates created for the 1992 ARPA Message Understand-
ing Conference (MUC) to extract data from a series of Associated Press and Reuters
newswire articles. The extracted data was then organized and enriched by a content
planner, which added associated information text from a knowledge base, and passed
this data to a language generator, which created English sentences with the proper
syntax and inflection, resulting in a summary paragraph of one or more sentences
(McKeown and Radev, 1995). SUMMONS was one of the first systems to prepare
summaries for a series of documents and provided a model for future explorations of
abstractive text summarization.
16
2.4.2 RALI-DIRO at TAC 2010
At TAC 2010, an abstractive system was submitted by a collaboration between the
University of Montreal’s Recherche appliquee en linguistique informatique (RALI) and
Departement d’informatique et de recherche operationnelle (DIRO) (Genest and La-
palme, 2010). Their abstractive system was based on an intermediate representation
of text, between extraction and generation, which they call an information unit (InIt).
They defined an information unit as the smallest element of coherent information that
can be extracted from a sentence. However, given the complexity of sentences, an infor-
mation unit may refer to an sentence as small as a single noun or extend to an entire
clause describing an event. For their TAC 2010 system, they restricted information
units to be subject-verb-object triples that could be extracted from dependency parses
of sentences using the MINIPAR 2 parser. They were able to attach time and location
information for each sentence as properties of the extracted triples. The triples, time
and location information, and original extracted sentences were then used to remove
redundancies in the extracted sentences and generate short concise English sentences
using the SimpleNLG 3 generation system.
2.4.3 Human EXtraction for TAC: HexTac
The HexTac system, a participant in TAC 2009, was designed to set an upper bound
on the extractive summarization approach and serve as one of the baseline systems
that other were compared to (Genest et al., 2009). The system was designed as a set of
tools for human summarizers to use to extract entire sentences from a set of texts and
create a 100 word summary. The human summarizers could only use their own judg-
ment to decide on which sentences were the best candidates and could not change any
aspect of the sentence itself. The summaries produced by the HexTac system received
higher scores that all automated text summarization systems for linguistic quality
and overall responsiveness, but still were unable to beat any of the human generated
2http://webdocs.cs.ualberta.ca/ lindek/minipar.htm3http://code.google.com/p/simplenlg/
17
abstractive model summaries. The system also performed very well in ROUGE eval-
uation, suggesting it could be a candidate approach for creating extractive models for
system comparison not unlike the abstractive models used in the TAC evaluation at
TAC 2009. The conclusions drawn in the HexTac paper overall is that although the
human extractive summaries were not able to beat human abstractive summaries in
regards to linguistic quality and overall responsiveness, they did perform better than
all automatic summarization systems, indicating that there is still ‘headroom’ for im-
provement in extractive summarization.
2.5 Summarization Shared Tasks
Automatic text summarization has been a featured task in the last 11 years of the
National Institute of Standards and Technology (NIST) sponsored annual workshops,
the Document Understanding Conference (DUC) 2001-2007, and the Text Analysis
Conference (TAC) 2008-2011. The annual workshops have provided a common data
and evaluation framework for participants to develop and compare text summarization
systems.
2.5.1 Document Understanding Conference
The Document Understanding Conference, sponsored by the Advanced Research and
Development Activity (ARDA), and run by the National Institute of Standards and
Technology (NIST), emerged in 2000 out of the need for a common evaluation frame-
work for summarization tasks previously sponsored programs run independently by
DARPA’s Translingual Information Detection Extraction and Summarization (TIDES),
ARDA’s Advanced Question & Answering Program and NIST’s TREC (Text Retrieval
Conferences). The first DUC conference was held September 13-14, 2001 in New Or-
leans, Louisiana 4. Twenty five groups participated and fifteen sets of summaries were
submitted for evaluation.
DUC followed a similar annual cycle over its seven year run: (1) call for participa-
4http://wwwnlpir.nist.gov/projects/duc/pubs/2001slides/pauls slides/index.htm
18
tion (2) release of test data (3) submission deadline (usually two weeks following test
data release) (4) return of evaluation results (5) submission of workshop papers (6)
workshop (a two day event) (7) final papers published.
Task Guidelines
DUC task guidelines evolved over the lifetime of the conference. During the first year,
both single and multiple document summaries of varying lengths, as well as generic
and query-focused summarization tasks were shared by participants. In its final year,
DUC 2007, the guided summarization task was defined as the creation of a 250 word
text summary of a given topic, a topic statement, and 25 pre-selected topic-related
newswire documents. Simple lists of names, events, dates, etc. were discouraged in
pursuit of fluent and readable summaries consisting of sentences. An optional update
task required a 100 word summary of new information from an additional document
set 25 pre-selected newswire documents. NIST assessors were responsible for select-
ing topic documents from a newswire corpus, defining topic titles, and queries. The
newswire documents were selected from the AQUAINT corpus, which includes Asso-
ciated Press and New York Times articles from 1998-2000 and Xinhua News Agency
articles from 1996-2000.
The DUC series of conferences showcased many advances in multi-document au-
tomated text summarization and lead to the development of the manual evaluation
Pyramid framework as well as the automated statistical automated tools, ROUGE
and BE.
2.5.2 Text Analysis Conference
The Text Analysis Conference was established in 2008 and the goal of the conference
was to re-emphasize and encourage participants to go beyond extractive summariza-
tion approaches and automated statistical measures to explore deeper linguistic analy-
sis. A new aspect-oriented approach reshaped the guided summarization task in 2009,
as well as the addition of two companion tasks: Knowledge Base Population (KBP) and
19
Recognizing Textual Entailment (RTE).
The TAC 2011 Guided Summarization task required participants create a one hundred-
word summary of ten newswire documents and a subsequent update one hundred-word
summary of an additional ten newswire documents. Forty four topic collections of two
document sets of ten relevant documents were divided into five pre-determined cate-
gories of topic:
1. Accidents and Natural Disasters
2. Attacks
3. Health and Safety
4. Endangered Resources
5. Investigations and Trials
Unlike the narrative topic inputs to the DUC summarization task (title and one or
more natural language sentences) TAC defines a set of aspects for each topic category
that guide the summarization task. For example the aspects of the topic category
Accidents and Natural Disasters are defined as:
1. WHAT: what happened
2. WHEN: date, time, other temporal placement markers
3. WHERE: physical location
4. WHY: reasons for accident/disaster
5. WHO AFFECTED: casualties (death, injury), or individuals otherwise negatively
affected by the accident/disaster
6. DAMAGES: damages caused by the accident/disaster
7. COUNTERMEASURES: countermeasures, rescue efforts, prevent efforts or other
reactions to the accident/disaster.
TAC Cycle
The TAC cycle is similar to DUC: registration, system development, test data release,
submission, evaluation, and workshop. Typically, participants develop systems based
20
on previous year’s data and when data are released they have a short period of time
to run their system against unseen test data to produce ‘runs’ of summaries that are
submitted to NIST for evaluation. Summaries are manually evaluated using three
frameworks: Pyramid, Readability/Fluency and Responsiveness. Summaries are also
evaluated using two automated statistical evaluation tools, ROUGE and BE.
2.6 DUC/TAC Evaluation Frameworks
During the series of DUC conferences between 2001-2007, both human and automatic
evaluation frameworks were developed to measure automatic text summarization sys-
tems. These frameworks continue to be employed in the TAC conferences 2008-2011.
The human evaluation frameworks include Pyramid and a simple scale for readabili-
ty/fluency and responsiveness. Automated statistical tools include ROUGE and BE.
2.6.1 Pyramid
The Pyramid method for human assessment of summaries was originally proposed for
DUC 2005 (Passonneau et al., 2005) and has been used every year since in both DUC
and TAC Workshops to evaluate summaries. The Pyramid method relies on human
annotators to define and identify Semantic Content Unit (SCU)s within a selection of
human model and automated peer summary submissions.
At the beginning of the TAC evaluation period, NIST assigns four human assessors
to each topic statement and its two sets of ten documents5. The assessors are respon-
sible for creating a one hundred word model summary based on the TAC published
guidelines for creating model summaries6. For the 2010 and 2011 TAC Workshops,
summaries are guided by categories and their aspect, therefore model summary au-
thors create their summaries with the intension of covering all aspects for each cate-
gory in the one hundred word summary.
5The two document sets represent the ten documents that have been identified as relating to the topicstatement for an initial and update summary
6http://www.nist.gov/tac/2011/Summarization/guided summarization.instructions.pdf
21
Pyramid evaluation requires the creation of SCUs for each topic’s four model sum-
maries. The SCUs represent the most important units of information identified by a
human in a summary. Each SCU is associated with a category aspect, and given a
weight based on how many of the four model summaries it occurs in. A SCU is a se-
mantic label which is associated with one or more contributors from each summary. A
contributor is a continuous or discontinuous string of words that have the same mean-
ing as the semantic label, see (Passonneau et al., 2005) for an example.
During the peer summary evaluation phase, the NIST assessor will score each peer
summary for the number of SCUs it contains. The repetition of SCUs does not increase
or decrease the score. SCUs are counted only once. The final Pyramid score for a peer
summary equals the sum of the weights of SCUs divided by the maximum possible
sum of SCUs7.
2.6.2 Readability/Fluency and Responsiveness
Readability/Fluency and Responsiveness are both evaluated on a scale of 1-5: (1) Very
Poor (2) Poor (3) Barely Acceptable (4) Good and (5) Very Good.
Readability/Fluency captures the grammaticality, non-redundancy, referential clar-
ity, focus, and structure and coherence of a summary. Aspect coverage or information
quality is not to be considered by assessors when scoring a summary for readabili-
ty/fluency. Responsiveness is a mixture of aspect coverage and readability, entities and
events relating to categories and aspects are important, but cannot simply be injected
into the summary without considerations of readability/fluency.
2.6.3 Recall-Oriented Understudy for Gisting Evaluation (ROUGE)
ROUGE8 was inspired by an investigation of using BLEU and NIST to evaluate the
quality of automatic summaries compared to human model summaries in (Lin and
Hovy, 2003). The ROUGE tool was subsequently developed and first featured in DUC
7determined by the average number of SCUs in the model summaries8http://berouge.com/default.aspx
22
2004(Lin, 2004). ROUGE compares automatically generated summaries to human
model summaries by looking at overlapping features like n-grams, word sequences,
and word pairs. Options at run-time to the ROUGE script configure the unit of overlap
that is measured and compared in the summaries. The two that are automatically run
for all systems during the evaluation phase of TAC are ROUGE-2, which is bigram
based, and ROUGE-SU4 which is also bigram based, but allows for a maximum 4
token gap between grams.
2.6.4 Basic Elements (BE)
In an effort to improve upon the ROUGE automatic evaluation tool, Hovy et al. in-
troduce a new evaluation tool for DUC 2005, Basic Elements (BE)9(Hovy et al., 2005).
BE tackles the problem of measurement unit for automatic comparison between sum-
maries. Basic Elements uses shallow syntactic information to construct increasingly
larger units starting with a single word. Summaries are then compared using the
Basic Elements units.
2.7 Natural Language Processing Software Libraries
The availability of open source natural language processing libraries for multiple oper-
ating systems platforms and programming languages supports the rapid development
of custom NLP processing pipelines with robust and tested off-the-shelf modules, sig-
nificantly reduced development time. In the sections following I describe two general
purpose software libraries for natural language processing, GATE and the Stanford
coreNLP package.
General Architecture for Text Engineering
The General Architecture for Text Engineering (GATE)10 from the University of Sheffield
is a multi-platform open source natural language processing suite of tools written in
9http://www.isi.edu/publications/licensed-sw/BE/index.html10http://gate.ac.uk/
23
Java (Cunningham et al., 2002). First released in 2000, GATE has evolved into an
embedded framework (Java libraries) that can be used to create custom NLP systems,
an Integrated Development Environment that provides a visual programming inter-
face for developing NLP solutions, and a cloud service for distributed service-oriented
software systems. GATE’s embedded framework consists of an extensive library of
NLP-oriented classes and methods in Java. It provides object structures for common
NLP resources, like documents, annotations, corpora, lexical databases, and ontolo-
gies and an execution pipeline that hosts one or more processing modules. GATE’s
logical architecture of resources, pipelines, and processing modules is similar to that of
Apache UIMA (Unstructured Information Management Architecture)11. It includes a
set of classes to integrate UIMA modules directly into a GATE pipeline and a strategy
to include GATE plug-ins into a UIMA pipeline. The component model in GATE is
called CREOLE (Collection of REusable Objects for Language Engineering).
ANNIE (a Nearly New Information Extraction system) is a default set of CREOLE
plugins that form a generic pipeline for linguistic annotation which consists of the
following components:
Document Reset Processing Resource: a component that removes any existing
annotation sets from a GATE Document Language Resource
English Tokenizer: a tokenizer that splits text into simple tokensnumbers, punctu-
ation and words of different types and pre-processes tokens using a JAPE trans-
ducer to provide additional features to the ANNIE PoS Tagger.
Gazetteer: a list-driven Named Entity tagger
Sentence Splitter: a sentence splitter that preprocesses sentences for the ANNIE
PoS Tagger
Part of Speech Tagger: a modified version of the Brill Tagger
Named Entity Transducer: Additional JAPE-based heuristics for Named Entity recog-
nition
Orthomatcher: an Orthographic Named Entity co-reference tagger
11http://uima.apache.org/
24
Stemmer: a rudimentary stemming algorithm
Anaphor Resolver: a co-reference resolution module that finds antecedents for peo-
ple, and optionally other Named Entities such as locations
GATE can export a default stand-off XML representation of its Annotation collections.
Stanford CoreNLP
The Stanford CoreNLP 12 package is a a comprehensive Natural Language Processing
library written in Java. Unlike GATE, the Stanford CoreNLP library does not include
an integrated development environment or an off-the-shelf plug-in architecture. It is
designed as a general framework that composes easily with end-to-end applications.
It is a fusion of a selection of Stanford standalone NLP tools into an integrated pro-
gramming and execution environment in Java. The central component in all Stanford
CoreNLP solutions is the pipeline of annotators. These modules produce a stand-off
XML representation of annotations over input text. Each annotator either adds to or
builds on previous annotations to produce an aggregate stand-off representation that
can be serialized into stand-off XML.
A example of a default pipeline of annotators for the Stanford coreNLP includes:
Tokenizer: a Penn Treebank-style tokenizer extended for noisy web input
Sentence splitter: sentence splitter that can be extended by parameters. By default
uses end-of-line.
Part-of-Speech Tagger: integration of the latest release of the standalone Stanford
Parser
Lemmatization: generates word lemmas for all tokens
Named Entity Recognition: integration of standalone extendable NER services. By
default PERSON, LOCATION, ORGANIZATION, MISC, DATE/TIME.
Dependency Parser: integration of the latest release of the standalone parser pro-
viding full syntactic information including constituents and dependencies.
12http://nlp.stanford.edu/software/corenlp.shtml
25
Co-reference Resolver: integration of standalone coreference resolution services.
26
Chapter 3
METHODOLOGY
In this chapter, I outline the methodology I followed to compare two methods of calcu-
lating LLR. The first is the standard method of calculating LLR used to rank sentences
for extractive text summarization described in (Nenkova and McKeown, 2011). The
second is an alternate method I propose based on dependency relations. I elaborate
on the motivation for my approach and describe the overall structure, program flow,
statistical methods, and evaluation framework I selected in the design of the systems
I built to compare the methods.
3.1 Topic Signatures
Topic signatures have been and continue to be used to select and rank candidate sen-
tences for automatic text summarization since being introduced by Lin and Hovy in
1999 (Hovy and Lin, 1999). Log Likelihood Ratio (LLR), LLR with cut-off (LLR-C),
and LLR with cut-off and query-focused (LLR-CQ) are three statistical measures of
frequency used to calculate topic signatures for summarization. The formula for calcu-
lating LLR is from (Dunning, 1993).
λ =L(p, k1, n1)L(p, k2, n2)
L(p1, k1, n1)L(p2, k2, n2)(3.1)
where likelihood is calculated 1:
L(p, k, n) = pk(1− p)n−k (3.2)
1Because the numerator and denominator have the same binomial coefficients in them, they are can-celed out and are removed from the formula, greatly simplifying its calculation
27
and the probabilities are defined as
p1 =k1n1, p2 =
k2n2, p =
k1 + k2n1 + n2
(3.3)
where k1 is the count of an information unit (typically a word) within the foreground,
n1 is the total number of information units in the foreground, k2 is the count of the in-
formation unit within the background corpus, and n2 is the total number of information
units in the background corpus. The information units in the foreground can be ranked
based on their LLR values. Information units with the highest LLR values are consid-
ered the most descriptive of the topic of the foreground corpus and are collectively de-
scribed as the topic’s signature (Hovy and Lin, 1999). LLR has been demonstrated to be
the best measure of descriptiveness for greedy sentence-by-sentence multi-document
summarization (Nenkova and McKeown, 2011).
Log Likelihood Ratio with Cut-off (LLR-C), assigns a cut-off value for LLR ranked
information units. Based on the restatement of Equation 3.1 in (Dunning, 1993),
−2logλ can be correlated to the chi-squared statistic.
−2logλ = 2[logL(p1, k1, n1) + logL(p2, k2, n2)
−logL(p, k1, n1)− logL(p, k2, n2)](3.4)
In (Lin and Hovy, 2000) the cut off weight for −2logλ was set at 10.83 with confidence
level α t = 0.001 (chi-squared lookup). An LLR-C approach uses the cut-off value to
determine the information units that constitute a topic signature and then in the ap-
plication of the topic signature to weighting sentences, assigns a value of either one or
zero to information units within each sentence. The sentences with the highest aver-
age score, within a minimum and maximum length threshold, are considered the best
candidate sentences for extraction.
The Log Likelihood Ratio with Cut-off and Query-focused (LLR-CQ) approach uses
a query-filtered topic signature to assign weights to information units in sentences. A
query is transformed into a set of information units (typically words). The weighting
of sentences is then calculated by a one or zero for each information unit if it appears
in both the query and the topic signature.
28
3.2 Calculating LLR with Dependency Relations
In this thesis I propose an alternate method for calculating the probabilities used in
the formula for LLR. The standard method for calculating the probabilities for LLR is
described in the preceding section and Equation 3.3. Units of information are counted
as they occur within the sentences of a topic-focused corpus of documents (foreground)
and a more general non-topic-focused corpus of documents (background). I propose an
alternate method for counting units of information based on the dependency relations
derived from sentences rather than the sentence itself. In this new method, a unit of
information is counted for each dependency relation it participates in either as a de-
pendent or a governor. This alternate method is motivated by the observation that in
a collapsed and propagated dependency representation of a sentence, certain informa-
tion units participate in multiple relations (Marneffe et al., 2006). Below is an example
of collapsed and propagated dependency relations for the sentence, Bills on ports and
immigration were submitted by Senator Brownback, Republican of Kansas2.
nsubjpass(submitted, Bills)
auxpass(submitted, were)
agent(submitted, Brownback)
nn(Brownback, Senator)
appos(Brownback, Republican)
prep_of(Republican, Kansas)
prep_on(Bills, ports)
conj_and(ports, immigration)
prep_on(Bills, immigration)
Given the definition of probabilities for LLR calculation in Equation 3.3, the new
method requires a restatement of counts that make up each of the probabilities in the
formula. k1 is the count of the number of dependency relations an information unit par-
ticipates in within the collection of dependency relations generated in the foreground
corpus. n1 is the total number of dependency roles a unit of information participates
2http://nlp.stanford.edu/software/stanford-dependencies.shtml
29
in in the foreground. k2 is the number of dependency relations an information unit
participates in within the collection of dependency relations generated in the back-
ground corpus, and n2 is the total number of dependency roles a unit of information
participates in in the background. My hypothesis is that the count of the number of
dependency roles that a unit of information participates in could contribute to a more
descriptive topic signature3.
Given the sample sentence and generated dependency relations above, the proba-
bilities for information units in a sentence based on the standard method and my new
proposed method compared are:
Table 3.1: Comparison of simple sentence-based probabilities calculated by word count
and by participation in dependency role count
Count Bills on ports and immigration were submitted
word 1/13 1/13 1/13 1/13 1/13 1/13 1/13
dependency role 1/6 - 1/9 - 1/9 1/18 1/6
by Senator Brownback of Republican Kansas
word 1/13 1/13 1/13 1/13 1/13 1/13
dependency role - 1/18 1/6 - 1/9 1/18
In the sentence above, the proposed dependency relation count assigns higher proba-
bility to information units that participate in more relations.
The definition of unit of information for calculating LLR may also have an impact
on effectiveness of topic signatures. I defined four types of information unit that I
planned to use for experiments comparing the two methods of counting. Definitions
for the four types of information unit are:
3Collapsed and propagated dependency relations produced by the Stanford coreNLP package are de-scribed in (Marneffe et al., 2006) and in a short overview on http://nlp.stanford.edu/software/stanford-dependencies.shtml
30
1. a word
2. a case-neutral lemmatized word
3. a case-neutral lemmatized word combined with a part-of-speech tag
4. a case-neutral lemmatized word combined with a generalized part-of-speech tag
restricted to nouns, verbs, and adjectives.
In order to compare the two counting methods within the evaluation framework of an
applied task, I planned to take advantage of an existing evaluation framework, data,
and well defined shared-task for summarization. I based the design of my system on
the guided summarization task guidelines of TAC 20114 and participated in the 2011
TAC shared task.
3.3 TAC 2011 Guided Summarization Task
The TAC 2011 Guided Summarization task was to create a one-hundred-word sum-
mary of ten newswire documents and a subsequent one hundred-word update sum-
mary of ten additional newswire documents. The newswire documents were pre-
selected by NIST assessors and assigned to topic collections. Forty four topic collections
of two document sets of ten relevant documents were divided into five pre-determined
categories of topic: (1) Accidents and Natural Disasters (2) Accidents and Natural
Disasters (3) Attacks (4) Health and Safety (5) Endangered Resources and (6) Investi-
gations and Trials.
The NIST assessors for TAC 2011 defined a set of aspects for each topic category
that are intended for participants to guide the summarization task. For example the
aspects of the topic category Accidents and Natural Disasters are defined as:
1. WHAT: what happened
2. WHEN: date, time, other temporal placement markers
3. WHERE: physical location
4. WHY: reasons for accident/disaster
4http://www.nist.gov/tac/2011/Summarization/index.html
31
5. WHO AFFECTED: casualties (death, injury), or individuals otherwise negatively
affected by the accident/disaster
6. DAMAGES: damages caused by the accident/disaster
7. COUNTERMEASURES: countermeasures, rescue efforts, prevent efforts or other
reactions to the accident/disaster.
For my thesis, I wanted to compare methods of counting units of information only,
and for that reason, I did not feel it was necessary to extract novelty information for
the update task or use aspects to guide summarization in order to build a generic
text summarization system that could be evaluated using TAC model summaries and
the ROUGE statistical evaluation tools. The summaries I intended to generate for
both the initial and update topic-focused document collections were therefore simply
generic summaries.
3.3.1 TAC Cycle
The TAC 2011 cycle was similar to previous years: registration, system development,
test data release, submission, evaluation, and workshop.
Table 3.2: Text Analysis Conference 2011 Schedule
Date Milestone
June 3 (2011) Deadline for TAC 2011 track registration
July 1 Release of test data
July 17 Deadline for participants’ submissions
Sept 7 Release of individual evaluated results
Nov 14-15 TAC 2011 Workshop
3.4 Development and Testing Data
I planned to follow the training and testing cycle of the TAC conference, where previous
years’ data are used to develop and train a system in preparation for testing against the
32
current year’s data when it is released by NIST. In a typical conference cycle, the data
are released and participants are given a short period of days to test their system and
submit their summaries to NIST for evaluation. Evaluation results for all participants
and a comparison between teams are then published in the month following.
New for TAC 2011 was the availability of an alternate version of the source data,
called clean data. The clean data version of the training and testing data was created
by the CLASSY team from the IDA/Center for Computing Sciences and the University
of Maryland, a long time participant in DUC and TAC Workshops. Their sentence
pre-processing module is mature and very good at identifying correct sentence splits
and removing noise from the LDC AQUAINT and AQUAINT-2 newswire source data
collections. The only caveat for using the clean data format was an eight-day delay
after the official NIST TAC testing data release data. The clean data was released on
July 8, 2011. I chose to use the clean data versions of both the TAC 2010 and TAC
2011 data.
Table 3.3: Text Analysis Conference Data
TAC Corpus LDC Catalog Number
Train TAC 2010 LDC AQUAINT-2 LDC2008T25a
Test TAC 2011 TAC 2010 KBP Source Data LDC2010E12b
a http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2008T25b http://projects.ldc.upenn.edu/kbp/data/
The LDC AQUAINT-2 corpus consists of newswire articles from Agence France Presse,
Associated Press, Central News Agency (Taiwan), Los Angeles Times-Washington Post
News Service, Inc., New York Times, and Xinhua News Agency from the years 2004-
2006. The TAC 2010 KBP Source Data consists of newswire articles from New York
Times, Associated Press, and Xinhua News Agency from the years 2007-2008.
33
3.5 Evaluation
During development, the summaries generated by the system were to be evaluated
statistically using ROUGE. In the test phase of the TAC cycle, the system generated
runs of summaries would be additionally evaluated by NIST assessors using a manual
measure of readability/fluency, responsiveness, and the Pyramid method. Due to the
prohibitive cost in time and human effort, my plan was to evaluate any subsequent
runs post-TAC solely using ROUGE.
3.6 System Design
I selected the Stanford coreNLP package as the central natural language processing
component for my generic text summarization system. It provides a collection of an-
notations for input text sentences, including delimited word tokens, split sentences,
parts-of-speech, lemmatization, named entities, coreferences and dependencies. Addi-
tional modules pre-process the TAC document sets, prepare sentences for the coreNLP
pipeline, create information units from annotations post-pipeline processing, count in-
formation units for sentences, topics, and the overall corpus, calculate LLR values,
score, rank, and select sentences for summary generation. The following sections de-
scribe each of the steps of the program flow and system design. See Figure 3.1 for a
visual representation of the program flow and module composition of the system de-
sign.
34
Import from Clean Data
Tokenization
Sentence Splitting
POS Tagging
Morphological Analysis
NE Recognition
Syntactic Parsing
Coref Resolution
Stanford CoreNLP
Feature Extraction
LLR Calculation
Sentence Selection and Ranking
Summary Generation
Figure 3.1: System Program Flow
3.6.1 Pre-processing
The clean version of the 2010 and 2011 TAC data are packaged as a collection of XML
files with a one-to-one correspondence with the original topic file from the TAC data
set. The XML files contains tags and attributes that delimit meta-information, article
titles, and a post-sentence split list of sentences categorized with three possible values:
35
negative one for sentences that are considered to be “noise” in the article and are not
considered legitimate sentences, zero for heading/title sentences, and one for sentences
that are considered valid. The pre-processing module iterates through the TAC clean
data set and creates a new file of filtered sentences, metadata, and heading/titles.
3.6.2 Stanford coreNLP Annotation Pipeline
The annotation pipeline module is responsible for creating the appropriate represen-
tation of sentences for input to a Stanford coreNLP processing, initiating, executing,
and outputting the results of the coreNLP annotators. The annotators in the coreNLP
pipeline are: tokenization, sentence splitting, part-of-speech tagging, lemmatization,
named entity recognition, dependency parsing, and co-reference resolution. The output
of this module is a file containing the stand-off XML representation of the annotations
that result from running the pipeline.
3.6.3 Information Unit Extraction
The annotation file produced as the output of the coreNLP module provides the nec-
essary information to create information units for calculating topic signatures. Infor-
mation units are extracted from the annotation file and output in sentence/info unit
pairings as well as file-scoped collections that reflect the count of the information unit
within the overall file, across files in the topic document sets, and across the entire
corpus.
3.6.4 LLR Calculation
The output files from the previous module are used to calculate LLR values for each
information unit in the topic and overall corpus. These values are then output as
topic signature files. The formula for LLR calculation is in a simplified form from the
original in (Dunning, 1993) and was adapted from the formula found in (Piao et al.,
36
2005).
2 ∗ ((a log a) + (b log b)+
(c log c) + (d log d) + (n log n)−
((a+ b)log(a+ b))−
((a+ c)log(a+ c))−
((b+ d)log(b+ d))−
((c+ d)log(c+ d)))
(3.5)
where a is the number of occurrences of the information unit in the topic, b is the num-
ber of occurrences of the information unit in the background corpus, c is the number of
occurrences of all other information units in the topic, d is the number of occurrences
of all other information units in the background corpus, and emphn is the total num-
ber of all information units both in the topic and background corpus. The result of the
LLR calculations are individual topic signature files with a list of information units
and their LLR value.
3.6.5 Sentence Selection and Ranking
After all LLR calculations have been made for each topic signature, the topic signature
is used to weight and rank sentences. Each sentence in each topic document set is
scored based on the number of topic signature information units it contains and their
cumulative values. Sentences are ranked across all documents in the topic document
set and output in a sorted candidate sentence file.
3.6.6 Summary Generation
For each topic, the highest ranked candidate sentences are selected to form a summary,
filtered by a basic token-based redundancy measure and the maximum length of 100
words. A running total of the number of words in the summary as well as a hash set
of tokens of candidate sentences already chosen are used to filter candidate sentences.
If adding a sentence would exceed the length of 100 words, it is skipped and the next
37
sentence is considered. The process continues until only an acceptable gap remains
between the size of the summary and the maximum 100 words.
38
Chapter 4
IMPLEMENTATION
In this chapter I describe the implementation of three generic text summarization sys-
tems I built in the course of my thesis work. The chapter highlights the differences
between the three systems and how they diverge in implementation from the original
system design described in my methodology chapter. I changed the design of my system
and experiments after completing System I and participating in the TAC shared-task
life cycle. System II and System III are very different than System I, and because of
these differences cannot really be compared based on the ROUGE scores of the sum-
maries they generated. System III represented the final version of my original design
and was used to conduct experiments on the two approaches to counting using the
four definitions of information units. These experiments and results are discussed in
Chapter 5.
4.1 Systems Overview
System I was developed in the period before the release of the TAC 2011 test data and
was used to generate two runs of summaries for official submission to TAC on July
17, 2011. It was further re-factored post-TAC evaluation for additional experiments
as System II. Based on a review of both System I and II, implementation errors and
design flaws were identified and fixed and a third system was built. Because of the
differences between System I and the other two systems, the ROUGE scores of the
summaries it generated cannot really be compared to System II and System III.
All three systems were developed primarily in Java on Linux, using the Stanford
coreNLP package for annotation, Bash scripts for preparing and executing scripts, and
Condor for parallelizing jobs based on data segmentation. The version of the Stanford
coreNLP package incorporated into System I was 1.1, released on 2011-06-08. System
39
II and III incorporated version 1.2 released on 2011-09-18.
The Stanford Parser Annotator in the coreNLP package is version 1.6.9. The PCFG
parser and factored parser are explained in depth in (Klein and Manning, 2003a) and
(Klein and Manning, 2003b). The English Stanford Dependencies representation out-
put by the dependency parser are described in (Marneffe et al., 2006). The Part-Of-
Speech Tagger is version 3.0.4 and is described in (Toutanova et al., 2003). Stanford
Named Entity Recognizer (NER) is version 1.2.2 and is described in (Finkel et al.,
2005). The Stanford Deterministic Coreference Resolution System is described in (Lee
et al., 2011) and (Raghunathan et al., 2010).
4.2 System I
System I was realized as nine individual Java applications. The TAC 2010 and 2011
data was segmented into 92 and 88 topic-based collections, which enabled each appli-
cation to iterate over the TAC data as 92 and 88 parallel jobs on a Condor computing
cluster.
Table 4.1: System I: Application components
App Description
A01 filter clean data files, extract sentences, and output line-delimited sentence files
A02 create topic batch file lists
A03 Stanford coreNLP pipeline(tokenize, ssplit, pos, lemma, ner, parse, dcoref)
A04 Part-of-speech + lemma info unit extraction
A05 Dependency info unit extraction
A06 Build info unit counts
A07 Calculate LLR for info units
A08 Part-of-speech + lemma info unit generate summary
A09 Dependency info unit generate summary
40
Sentence Pre-processing
The clean data version of TAC 2010 and 2011 data was used for training and testing
data. Pre-processing was reduced to simply selecting the correctly categorized type of
sentence from the clean data representation (negative one and zero are ignored, one
is considered a candidate), harvesting a small amount of metadata for each file and
sentence and then serializing to a line-delimited sentence file for further processing.
No filters are used for document noise or additional sentence validation. App 01 was
used as a standalone application to create the line-delimited sentence file for each
article in each topic article document set (A and B) and App 02 was used to create file
lists for each topic to batch process with Stanford coreNLP package.
Sentence Annotation and Processing with the Stanford CoreNLP Package
A custom Stanford coreNLP pipeline applies individual coreNLP Annotators to each
sentence in each document in the corpus for both initial and update summary collec-
tions. The Annotations created by the pipeline of Annotators are serialized for each
document. The XML version of the Annotations is output by document in order to gen-
erate and access the co-reference information provided by the Stanford Deterministic
Coreference Resolution System across all the sentences in a document.
The CoreNLP options used were:
tokenize, ssplit (sentence splitter), pos (part-of-speech tagger),
lemma (stemmer), ner (Name Entity Recognizer),
parse (dependency parser), dcoref (coreference resolver)
App 03 fulfilled this process in the program flow and was realized as a Bash script that
called the Stanford coreNLP package from the command line providing the topic file
list as an argument and a output directory path for annotation output files.
Information Unit Extraction
App 04, App 05, and App 06 extract information units from the previous application’s
annotation files and creates counts of the units for each sentence, topic, and the overall
41
corpus. These counts are used for LLR calculation and sentence scoring in subsequent
modules. The feature count pairs are serialized to a line-delimited file for further
processing.
The following Penn Treebank style parts-of-speech annotations are selected exclu-
sively by the system for information unit creation. All other labeled tokens are dis-
carded.
CD, FW, IN, JJ, JJS, JJR, NN, NNP, NNS, NNPS, NPS, RB, RBR, RBS, R, SYM,TO,
VBD, VBN, VBG, VBP, VB, VBZ
The following verbs are also filtered:
is, are, were, be, have, could, shall, should, may, might, must, will, would,
go, goes, do, does, use, used, take, make, made, did, been, said, say, know
Named entity types output by default from the Stanford Named Entity Recognizer
include:
PERSON, ORGANIZATION, LOCATION, DATE, MONEY, MISC
tokens for the parts-of-speech + lemma information units are:
PART-OF-SPEECH_LEMMA, NER-TYPE_LEMMA
Information units based on dependencies include:
DEP_PART-OF-SPEECH_LEMMA, GOV_PART-OF-SPEECH_LEMMA
DEP_NER-TYPE_LEMMA, GOV_NER-TYPE_LEMMA
GOV_PART-OF-SPEECH_LEMMA_RELATION-TYPE_DEP_PART-OF-SPEECH_LEMMA
GOV_NER-TYPE_LEMMA_RELATION-TYPE_DEP_NER-TYPE_LEMMA
GOV_PART-OF-SPEECH_LEMMA_RELATION-TYPE_DEP_NER-TYPE_LEMMA
GOV_NER-TYPE_LEMMA_RELATION-TYPE_DEP_PART-OF-SPEECH_LEMMA
All co-references are disambiguated using document-wide co-reference annotations
created by the Stanford Deterministic Coreference Resolution System, and output in
the canonical form above depending on what entity they originally reference, whether
it is also identified as a named entity and what dependency relationship it participates
in.
42
LLR Calculations
An LLR calculation module iterates over the document sentence information unit/-
count files in order to calculate LLR for each term. The LLR values for each sentence
are serialized to a line-delimited file for further processing. App 07 calculates LLR and
outputs topic signature files for each topic.
Sentence Selection and Summary Generation
Sentences are ranked based on the aggregate counts of all of a sentence’s information
units based on topic signature values. Selected sentences are filtered for noise, re-
moving any artifacts that were not caught by the clean data process and have been
seen in the development data. Sentences of less that seven or greater than fifty words
are automatically discarded as are sentences that share more than seventy percent
of their information units with selected sentences. This filter is a simple brute force
comparison between the sentences’ unit of information collections. Sentence selection
continues until a threshold minimum of ninety three words or maximum of one hun-
dred words is reached (sentences are excluded if they are larger than the delta between
the selected sentences word count and the maximum of one hundred words).
4.2.1 System I Evaluation
Two runs, 8 and 29, generated by System I were submitted to TAC for evaluation. They
were differentiated by the counting methods they employed. Run 8 used the standard
count of information units in the sentences of the foreground and background corpus.
Run 29 used the count of the number of times information units participated in depen-
dency relations generated by the sentences of the foreground and background corpus.
However, the information units defined for both runs were not really comparable. They
were both based conceptually on a case-neutral lemmatized word combined with a re-
stricted part of speech, but were not realized as such in the implementation. For run
8, the Penn Treebank part-of-speech tag was combined with a case-neutral lemma
of the word as the basic unit of information, but entirely different information units
43
were used for named entities, and the dependency structures in RUN 29. These errors
among others were remedied in System II.
In Tables 4.2, 4.4, and 4.5 below, the two TAC baseline runs, 1 and 2, are included.
Baseline 1 was created by using the first 100 words from the most recent newswire
article in the summary document set. Baseline 2 was created with the open source
off-the-shelf summarizer, MEAD1.
Table 4.2: System I Evaluation
RUN RGE-1 RGE-2 RGE-SU4 RUN RGE-1 RGE-2 RGE-SU4
A 8 0.3104 0.0591 0.0970 B 8 0.3079 0.0554 0.0970
A 29 0.3328 0.0749 0.1122 B 29 0.3196 0.0687 0.1054
A 1 0.3184 0.0673 0.1046 B 1 0.3054 0.0590 0.0983
A 2 0.3597 0.0964 0.1309 B 2 0.3207 0.0666 0.1035
4.3 System II
The second version of the generic text summarization system was re-factored into a
single Java application with a configuration file and command line arguments that ex-
ecute different component functionality. The new version of the system integrated the
Stanford coreNLP package from within the Java application through coreNLP Applica-
tion Programming Interfaces (API)s and reused methods that were redundant across
applications in the original collection of applications. All input and output files were
represented in XML, which made file readers and writers standard for all data and
intermediary files and made possible combination files that included the original TAC
clean data XML and the annotation tags output from the Stanford coreNLP pipeline.
System II takes fifteen different sets of command line arguments to enable fifteen in-
dependent modules of functionality. All fifteen stages of the summarization process
are executed using Bash scripts with topic-scoped arguments enabling parallelization
across the topic document sets on a Condor computing cluster.
1http://www.summarization.com/mead/
44
Significant changes were made to the way information units were represented in
System II which made the two runs it generated more comparable.
Table 4.3: System II: Application Components
App Description
A01 convert clean data to sentences
A02 convert sentences to annotations
A03 combine sentence and annotation files
A04 create info units
A05 calculate topic info unit totals
A06 calculate corpus info unit totals
A07 calculate LLR topic signatures
A08 create dependency info units
A09 calculate dependency topic info unit totals
A10 calculate dependency corpus info unit totals
A11 calculate dependency LLR topic signatures
A12 calculate sentence LLR scores
A13 calculate sentence dependency LLR scores
A14 create summaries
A15 create dependency summaries
Sentence Pre-processing
System II differs from System I at the sentence pre-processing stage by outputting an
XML file that includes additional meta-information, a heading sentence, and a raw text
line delimited tag for calling the Stanford coreNLP package API with a line-delimited
multi-sentence input String argument, enabling co-reference resolution across sen-
tences. A drawback of System I was its separation of the original sentences files from
the annotation XML. Rules for re-applying whitespace rules to punctuation and the
re-construction of the original sentence was required in System I. The new version of
the system represents the filtered TAC clean data information to be combined with the
45
annotation file by merging the XML representations in subsequent stage.
Sentence Annotation and Processing with the Stanford CoreNLP Package
System II uses a new version of the coreNLP package (version 1.2 – released 2011-
09-14) 2 and no longer calls the package on the command line. The coreNLP package
is integrated into the Java application itself and uses the coreNLP API for initiating,
executing, and outputting an annotation pipeline.
Information Unit Extraction
System II further restricts parts-of-speech and normalized the Penn Treebank tags to:
N for all nouns, V for all verbs, and ADJ for all adjectives.
FW, JJ, JJS, JJR, NN, NNP, NNS, NNPS, NPS, VBD, VBN, VBG, VBP, VB, VBZ
Verbs are no longer explicitly filtered. The LLR topic signature will de-emphasize
verbs that occur across the corpus. Named entity types are not output by the system
as explicit named entity tokens. System II no longer counts named entities in its
calculations. Tokens for the parts-of-speech + lemma information units are:
PART-OF-SPEECH_LEMMA
Information units based on dependencies are no longer explicit but are normalized to
part-of-speech + lemma. The token is counted for each dependency relation it partici-
pates in.
A bug in the new version of the Stanford coreNLP package occasionally gives an
unreachable index for representative mentions. Coreference resolution was disabled
for System II.
2http://nlp.stanford.edu/software/corenlp.shtml
46
LLR Calculations
The LLR calculation was changed in System II and now implements the classic formula
from (Dunning, 1993).
−2logλ = 2[logL(p1, k1, n1) + logL(p2, k2, n2)
−logL(p, k1, n1)− logL(p, k2, n2)](4.1)
which correlates to the statistic chi-squared. The chi-squared distribution model can
be used to establish statistical thresholds for determining topic signatures. In (Lin and
Hovy, 2000) the cut off weight for −2logλ was set at 10.83 with confidence level α t =
0.001 (chi-squared lookup). System II uses this version of LLR with cut-off (LLR-C) to
determine with information units are descriptive and should be included in the topic
signature.
Sentence Selection and Summary Generation
In System I, sentences were ranked based on the aggregate LLR values of the informa-
tion units they contain. In System II, LLR-C is used and sentence information units
are simply given a value of 1 if they are part of the topic signature or 0 if they are not.
Sentences are then ranked by their cumulative score of equally valued topic signature
information units.
4.3.1 System II Evaluation
The following were the ROUGE results for the two sets of summaries generated by
System II, run 75 and 76 against the TAC 2011 corpus. Runs 75 and 76 were defined
by the same counting strategies and definition of unit of information as System I runs
8 and 29.
47
Table 4.4: System II evaluation
RUN RGE-1 RGE-2 RGE-SU4 RUN RGE-1 RGE-2 RGE-SU4
A 75 0.31109 0.06944 0.10810 B 75 0.27632 0.04991 0.08623
A 76 0.28334 0.05263 0.09151 B 76 0.26238 0.04184 0.07853
A 1 0.3184 0.0673 0.1046 B 1 0.3054 0.0590 0.0983
A 2 0.3597 0.0964 0.1309 B 2 0.3207 0.0666 0.1035
4.4 System III
System II was intended to simplify the generic text summarization system and provide
a consistent representation of information units used in both counting approaches so
they could be truly compared. System III has the same program flow and module
composition as System II. The sections below provide details about the implementation
details between the two systems.
LLR Algorithm
Minor errors were discovered in first two versions of the LLR algorithm used in Sys-
tem I and System II. The new version of the LLR algorithm in System III reversed
the System II design and returned to a raw aggregate LLR-C number for topic signa-
ture inclusion rather than the LLR-CQ approach of System II. It also incorporated a
smoothing +1 count to overcome 0 background corpus counts for information units that
only existed within the foreground corpus. All calculations were reduced to natural log
(ln) additions and subtractions to improve performance.
Sentence Selection
The redundancy measure in System III was changed to use only information units
not surface string/token comparisons. Basic sentence filtering was adjusted to mini-
mum token count of 10, maximum of 100, and acceptable delta of 10 (sentences can be
between 90 and 100 tokens).
48
Sentence Cleaning
In System III sentences are filtered for location + UTC slug lines that only occur in
TAC 2011 data.
4.4.1 System III Evaluation
The following were the ROUGE results for Run 77 and 78 against the TAC 2011 corpus.
Run 77 and 78 were defined by the same counting strategies and definition of unit of
information as System I runs 8 and 29.
Table 4.5: System III evaluation
RUN RGE-1 RGE-2 RGE-SU4 RUN RGE-1 RGE-2 RGE-SU4
A 77 0.34924 0.09224 0.12629 B 77 0.31139 0.06056 0.09846
A 78 0.33460 0.07910 0.11352 B 78 0.30408 0.05452 0.09479
A 1 0.3184 0.0673 0.1046 B 1 0.3054 0.0590 0.0983
A 2 0.3597 0.0964 0.1309 B 2 0.3207 0.0666 0.1035
4.5 Comparison of Systems
The following tables compare System II and System III average F-measure ROUGE
results for the two counting approaches based on a common definition of information
unit.
Table 4.6: System Comparison: TAC 2010 Summary A/B ROUGE average F-measure
System 1 2 SU4 System 1 2 SU4
II-A 0.31109 0.06944 0.10810 II-B 0.27632 0.04991 0.08623
III-A 0.34924 0.09224 0.12629 III-B 0.31139 0.06056 0.09846
49
Table 4.7: System Comparison: TAC 2011 Summary A/B ROUGE average F-measure
System 1 2 SU4 System 1 2 SU4
II-A 0.28334 0.05263 0.09151 II-B 0.26238 0.04184 0.07853
III-A 0.33460 0.07910 0.11352 III-B 0.30408 0.05452 0.09479
4.6 Conclusion
Based on the system comparison, System III outperforms System II in all four topic
collections. System III was selected to run experiments to compare the two counting
methods and four definitions of information units. Experiments and results are dis-
cussed in the next chapter.
50
Chapter 5
EXPERIMENTS AND RESULTS
In this chapter I describe the design of the experiments I implemented for compar-
ing the two methods of counting I contrasted in Chapter 3 and discuss their results. In
Chapter 3, I proposed an alternate method for calculating LLR using the count of an in-
formation unit’s participation in dependency relations generated from sentences in the
foreground and background corpus. The new method differs from the standard method
of simply counting an information unit’s number of occurrences in the sentences of a
foreground and background corpus. Table 3.1 in Chapter 3 contrasts the probabilities
of a sentence’s information units based on the two different methods of counting. The
different probabilities for a simple sentence suggest that a count used to calculate LLR
based on dependency relations may boost the count of important information units and
contribute to a more descriptive topic signature.
To contrast the two methods of counting, I designed a series of experiments using
the Guided Summarization Task guidelines, data, and evaluation framework for TAC
2010 and TAC 2011. The goal of the experiments was to test the hypothesis within the
context of an established task, data, and evaluation framework.
In the first applications of LLR in text summarization, described in (Hovy and Lin,
1999) and (Lin and Hovy, 2000), information units were defined simply as words. I in-
cluded the definition of an information unit as a word as a baseline in my experiments
and included three other definition of information unit, increasing in abstraction and
restriction. If the results of the more restricted definitions of units of information were
the same or better than words themselves, the restricted method would be preferred
due to the reduced number of information units that are counted and used to calculate
LLR. System performance would improve because the amount of memory and number
of calculations would be reduced.
51
The four different definitions of information unit used to compare the two counting
methods were:
1. a word
2. a case-neutral lemmatized word
3. a case-neutral lemmatized word combined with a part-of-speech tag
4. a case-neutral lemmatized word combined with a generalized part-of-speech tag
restricted to nouns, verbs, and adjectives.
5.1 Experiment Design
The following table describes the type of counting method and unit of information
definition used to produce a collection of summaries for TAC 2010 and TAC 2011 data.
The standard count label refers to the standard method of counting information units
in a topic by their occurrence in the sentences of the documents in the foreground
and background corpus. The dependency count label refers to the proposed alternate
method of counting information units in a topic by their participation in the sentence-
based dependency relations of the document in the foreground and background corpus.
Table 5.1: Description of Experiments
RUN ID Type of Count Definition of Unit of Information
ID 77 standard case-neutral lemmatized word combined with part-of-speech
restricted to nouns, verbs, and adjectives
ID 78 dependency case-neutral lemmatized word combined with part-of-speech
restricted to nouns, verbs, and adjectives
ID 79 standard word
ID 80 dependency word
ID 81 standard case-neutral lemmatized word
ID 82 dependency case-neutral lemmatized word
ID 83 standard case-neutral lemmatized word combined with part-of-speech
ID 84 dependency case-neutral lemmatized word combined with part-of-speech
52
Included in the results tables are the two baseline runs created by TAC NIST assessors
for TAC 2010 and 2011. ID 1 Baseline was created by using the first 100 words from the
most recent newswire article in the summary document set. ID 2 Baseline is created
with the open source off-the-shelf summarizer, MEAD1.
5.2 TAC 2010 ROUGE Average F-measure Results
The row highlighted in bold represents the best performing counting method and unit
of information definition.
Table 5.2: Experiment Results: TAC 2010 Summary A ROUGE Average F-measures
RUN ID 1 2 SU4
ID 77 0.31877 0.06858 0.10403
ID 78 0.31444 0.06540 0.10002
ID 79 0.31537 0.06827 0.10367
ID 80 0.31814 0.06895 0.10348
ID 81 0.32644 0.06960 0.10542
ID 82 0.30948 0.06343 0.09893
ID 83 0.31770 0.06809 0.10409
ID 84 0.31491 0.06321 0.10076
ID 1 0.29531 0.05651 0.09029
ID 2 0.29861 0.06077 0.09361
The best performing run for topic A summaries in the TAC 2010 data collection was run
81. This run used a standard method for counting and defined its unit of information
as a case-neutral lemmatized word. Examples of summaries and topic signatures for
the best and worst performing runs against individual TAC 2010 Summary A topics is
features in Section A.1.1 of Appendix A.
The row highlighted in bold represents the best performing counting method and
unit of information definition.
1http://www.summarization.com/mead/
53
Table 5.3: Experiment Results: TAC 2010 Summary B ROUGE Average F-measures
RUN ID 1 2 SU4
ID 77 0.30453 0.05869 0.09600
ID 78 0.29659 0.05326 0.09098
ID 79 0.30185 0.05895 0.09423
ID 80 0.30025 0.05599 0.09387
ID 81 0.30052 0.05880 0.09469
ID 82 0.29690 0.05348 0.09152
ID 83 0.29844 0.05706 0.09317
ID 84 0.29657 0.05176 0.09079
ID 1 0.29087 0.05634 0.09273
ID 2 0.30331 0.06443 0.09923
The best performing run for topic B summaries in the TAC 2010 data collection was run
77. This run used a standard method for counting and defined its unit of information
as a case-neutral lemmatized word combined with part-of-speech restricted to nouns,
verbs, and adjectives. Examples of summaries and topic signatures for the best and
worst performing runs against individual TAC 2010 Summary B topics is features in
Section A.1.2 of Appendix A.
5.3 TAC 2011 ROUGE Average F-measure Results
The row highlighted in bold represents the best performing counting method and unit
of information definition.
54
Table 5.4: Experiment Results: TAC 2011 Summary A ROUGE Average F-measures
RUN ID 1 2 SU4
ID 77 0.34924 0.09224 0.12629
ID 78 0.33460 0.07910 0.11352
ID 79 0.35633 0.09160 0.12797
ID 80 0.34098 0.08069 0.11740
ID 81 0.33841 0.08550 0.11957
ID 82 0.33945 0.07870 0.11556
ID 83 0.35215 0.08878 0.12534
ID 84 0.33564 0.08151 0.11683
ID 1 0.3184 0.0673 0.1046
ID 2 0.3597 0.0964 0.1309
The best performing run for topic A in the TAC 2011 data collection was run 79. This
run used a standard method for counting and defined its unit of information as a word.
Examples of summaries and topic signatures for the best and worst performing runs
against individual TAC 2011 Summary A topics is features in Section A.1.3 of Ap-
pendix A.
The row highlighted in bold represents the best performing counting method and
unit of information definition.
55
Table 5.5: Experiment Results: TAC 2011 Summary B ROUGE Average F-measures
RUN ID 1 2 SU4
ID 77 0.31139 0.06056 0.09846
ID 78 0.30408 0.05452 0.09479
ID 79 0.30617 0.05993 0.09788
ID 80 0.30223 0.05859 0.09659
ID 81 0.30961 0.06417 0.10081
ID 82 0.30736 0.06066 0.09882
ID 83 0.30476 0.05841 0.09702
ID 84 0.30843 0.05779 0.09645
ID 1 0.3054 0.0590 0.0983
ID 2 0.3207 0.0666 0.1035
The best performing run for topic B in the TAC 2011 data collection was run 77. This
run used a standard method for counting and defined its unit of information as a case-
neutral lemmatized word combined with part-of-speech restricted to nouns, verbs, and
adjectives. Examples of summaries and topic signatures for the best and worst per-
forming runs against individual TAC 2011 Summary B topics is features in Section
A.1.4 of Appendix A.
56
Chapter 6
CONCLUSION AND FUTURE WORK
In this chapter I discuss conclusions about the comparison of the standard method for
counting information units for LLR calculations and my proposed alternate method
for counting information units for LLR calculations based on dependency relations. I
derive my comparisons from the results of experiments described in Chapter 5. I also
critique the overall design of my experiments and discuss how they can be improved
in a future work section.
6.1 Conclusion
In Chapter 3, I proposed an alternate counting method for calculating LLR for topic
signatures. Instead of the standard count of the number of times an information unit
occurs within the sentences of a foreground and background corpus, my alternative
method counted the number of times an information unit participated in either depen-
dent or governor roles in the dependency relations generated from sentences in the
foreground and background corpus. In the experiments I designed, I compared the
two methods using data from the TAC 2010 and TAC 2011 Guided Summarization
Tasks. The two methods were combined with four definitions of information unit and
contrasted by the results of a ROUGE evaluation of their n-gram overlap with human
generated model summaries. Table 5.1 in Chapter 5 describes the individual exper-
iments that were run on TAC data. Tables 5.2, 5.3, 5.4, and 5.5 list the ROUGE-1,
ROUGE-2, and ROUGE-SU4 average F-measure results of the evaluation of the sum-
maries generated by System III on TAC 2010 and 2011 data.
The results of experiments listed in Chapter 5 indicate minimal differences between
the two counting methods and are inconclusive regarding which of the unit of infor-
mation definitions was most effective in generating summaries. The best-performing
57
runs all used the standard method for counting information units, however the dif-
ference between these runs and all other runs was small. For example, in Table
5.5, the best run, 77 which used a standard approach to counting, had an average
F-measure ROUGE-1 score of 0.31139, ROUGE-2 score of 0.06056, and ROUGE-SU4
score of 0.09846. The run which used the same unit of information definition, but the
proposed dependency-based counting method, had an average F-measure ROUGE-1
score of 0.30408, ROUGE-2 score of 0.05452, and ROUGE-SU4 score of 0.09479. All
of the average F-measure experiments tabled in Chapter 5 have similar results. The
standard approach to counting outperforms the dependency-based approach, but the
difference between them is small.
In two of the four series of experiments, whose results are listed in Table 5.3 and
Table 5.5, the best performing relied on the definition of unit of information as a case-
neutral lemmatized word combined with part-of-speech restricted to nouns, verbs, and
adjectives. The other two tables, Table 5.2 and Table 5.4 used a word and a case-
neutral lemmatized word respectively. Again, the differences between the results of
the experiments of all the definitions of unit of information was minimal.
In Appendix A, the best and worst performing runs against individual topics are
listed for each of the four series of experiments, including the summary and topic sig-
nature they produced. In some cases, like the best performing runs for topic D1024F-A,
multiple runs have the same average F-measure, the exact same summary, but slightly
different topic signatures. For each of the best and worst performing runs for a topic,
all of the other runs are listed in a table, for example see Table A.23, to compare their
scores. These tables demonstrate that many runs, which have different topic signa-
tures, actually produce very similar if not exactly the same summary. Given the lim-
ited size of the final summary, one hundred words, and the small amount of sentences
within each corpus, different topic signature may end up picking the same or similar
sentences due to the lack of variety in the corpus. A larger summary, like the 250 word
summaries used in previous DUC shared tasks may be a better measure of counting
strategy and definition of unit of information.
58
6.2 Future Work
Extractive summarization systems weight and rank sentences in single or multiple
documents and then extract the best candidate sentences to form a summary. Al-
though most extractive systems employ some heuristics to ‘smooth’, ‘prune’, or ‘edit’
the extracted sentences and assembled summary, most of the words and phrases in
the summary originate directly from the original text. A much more difficult problem
is that of automatic abstractive summarization, where the machine needs to abstract
over the original information in single or multiple documents and generate new sen-
tences using words or phrases that may never have been in the original texts. The
field of automatic abstractive summarization is much less mature and there are very
few systems that have been developed to solve this problem. The abstractive approach
requires research into semantic representation, inference, and natural language gen-
eration. The majority of summarization systems have instead chosen to focus on ex-
tractive data-driven approaches (Erkan and Radev, 2004).
The topic signatures generated by the two different counting methods in this the-
sis were different from each other, but did not necessarily result in different 100-word
summaries when they were applied in System III experiments. One of the constraints
of the current system is, although dependency relations are being used to calculate
LLR topic signatures, the LLR values are being used to extract complete sentences
from a fairly small collection of topic documents. A future avenue of research might be
to extract specific dependency relations based on the topic signatures in an expanded
foreground corpus and generate sentences from the dependency relations themselves.
A semi-abstractive approach featured in the TAC 2010 Workshop (Genest and La-
palme, 2010) employed dependency relation tuples and additional linguistic informa-
tion from a shallow NLP pipeline as input to a natural language generation tool.
A promising direction towards an even more abstractive approach could be the re-
alization of a semi-abstractive summarization system integrating shallow and deep
processing. Leveraging the output of a deep processing infrastructure either in par-
allel or within in a shallow NLP pipeline offers the opportunity to integrate multiple
59
facets of linguistic information. A possible initial deep processing representation to ap-
ply statistical methods for summary generation might be the use of Minimal Recursion
Semantics (MRS) structures rather than surface strings to represent both query and
corpus and applying a frequentist approach using statistical methods for extractive
sentence selection, like an LLR topic-signature approach, with features based on a se-
mantic representation. Given the limitations of deep processing parsers with respect
to ill-formed or ungrammatical sentences, the integration of a shallow NLP pipeline
would allow a fall back to a shallow representation for complete coverage. Finally,
MRS-based generation tools could be used to create the final summary from assem-
bled MRS, using deep processing generation tools. This novel approach is supported
by earlier work on using MRS-based structures in Question and Answering (Dridan,
2006) and semantic search for scientific articles (Schafer et al., 2011).
60
BIBLIOGRAPHY
Baldwin, B. and Ross, A. (2001). Baldwin language technology’s DUC summarization
system. In Proceedings of Document Understanding Conference.
Barzilay, R. and Lapata, M. (2008). Modeling local coherence: An entity-based ap-
proach. Computational Linguistics, 34(1):1–34.
Blake, C., Kampov, J., Orphanides, A. K., West, D., and Lown, C. (2007). UNC-CH at
DUC 2007: Query expansion, lexical simplification and sentence selection strategies
for Multi-Document summarization. In Proceedings of Document Understanding
Conference.
Boros, E., Kantor, P. B., and Neu, D. J. (2001). A clustering based approach to creating
multi-document summaries. In Proceedings of Document Understanding Conference.
Bosma, W. (2005). Query-based summarization using rhetorical structure theory. In
15th Meeting of CLIN, LOT, Leiden, pages 29–44.
Brunn, M., Chali, Y., and Pinchak, C. J. (2001). Text summarization using lexical
chains. In Proceedings of Document Understanding Conference.
Carenini, G. and Cheung, J. C. K. (2008). Extractive vs. NLG-based abstractive sum-
marization of evaluative text: The effect of corpus controversiality. In Proceedings of
the Fifth International Natural Language Generation Conference, pages 33–41.
Conroy, J. M., Schlesinger, J. D., and O’Leary, D. P. (2007). Classy 2007 at DUC 2007.
In Proceedings of Document Understanding Conference.
Conry, J. M., Schlesinger, J. D., Rankel, P. A., and O’Leary, D. P. (2010). Guiding
CLASSY toward more responsive summaries. In Proceedings of the Text Analysis
Conference.
61
Copeck, T., Inkpen, D., Kazantseva, A., Kennedy, A., Kipp, D., Nastase, V., and Sz-
pakowicz, S. (2006). Leveraging DUC. In Proceedings of Document Understanding
Conference.
Cunningham, H., Maynard, D., Bontcheva, K., and Tablan, V. (2002). GATE: A Frame-
work and Graphical Development Environment for Robust NLP Tools and Applica-
tions. In Proceedings of the 40th Anniversary Meeting of the Association for Compu-
tational Linguistics (ACL’02).
Dridan, R. (2006). Using minimal recursion semantics in Japanese question answering.
PhD thesis, University of Melbourne Melbourne, Australia.
Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence.
Computational linguistics, 19(1):61–74.
Edmundson, H. P. (1969). New methods in automatic extracting. J. ACM, 16:264–285.
Erkan, G. and Radev, D. R. (2004). LexRank: graph-based lexical centrality as salience
in text summarization. In Proceedings of Document Understanding Conference.
Favre, B., Bechet, F., Bellot, P., Boudin, F., El-Beze, M., Gillard, L., Lapalme, G., and
Torres-Moreno, J. (2006). The LIA-Thales summarization system at DUC-2006. In
Proceedings of Document Understanding Conference.
Favre, B., Gillard, L., Torres-Moreno, J., Boudin, F., Bechet, F., and El-Beze, M. (2007).
The LIA summarization system at DUC-2007. In Proceedings of Document Under-
standing Conference.
Filatova, E. and Hatzivassiloglou, V. (2004). A formal model for information selection
in multi-sentence text extraction. In Proceedings of the 20th international conference
on Computational Linguistics, COLING ’04, Stroudsburg, PA, USA. Association for
Computational Linguistics.
62
Filippova, K. (2010). Multi-sentence compression: Finding shortest paths in word
graphs. In Proceedings of the 23rd International Conference on Computational Lin-
guistics, pages 322–330.
Finkel, J., Grenager, T., and Manning, C. (2005). Incorporating non-local information
into information extraction systems by gibbs sampling. In Proceedings of the 43rd
Annual Meeting on Association for Computational Linguistics, pages 363–370.
Fung, P. and Ngai, G. (2006). One story, one flow: Hidden markov story models for mul-
tilingual multidocument summarization. ACM Trans. Speech Lang. Process., 3:1–16.
Galley, M. (2006). A skip-chain conditional random field for ranking meeting utter-
ances by importance. In Proceedings of the 2006 Conference on Empirical Methods in
Natural Language Processing, EMNLP ’06, pages 364–372, Stroudsburg, PA, USA.
Association for Computational Linguistics.
Genest, P. and Lapalme, G. (2010). Text generation for abstractive summarization. In
Proceedings of the Third Text Analysis Conference, Gaithersburg, Maryland, USA.
National Institute of Standards and Technology.
Genest, P., Lapalme, G., and Yousfi-Monod, M. (2009). Hextac: the creation of a manual
extractive run. In Proceedings of the Second Text Analysis Conference, Gaithersburg,
Maryland, USA. National Institute of Standards and Technology.
Gupta, S., Nenkova, A., and Jurafsky, D. (2007). Measuring importance and query
relevance in topic-focused multi-document summarization. In Proceedings of the
45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions,
pages 193–196.
Harabagiu, S., Lacatusu, F., and Hickl, A. (2006). Answering complex questions with
random walk models. In Proceedings of the 29th annual international ACM SIGIR
conference on Research and development in information retrieval, pages 220–227,
Seattle, Washington, USA.
63
Hennig, L. (2009). Topic-based multi-document summarization with probabilistic la-
tent semantic analysis. In Recent Advances in Natural Language Processing, pages
144–149.
Hickl, A., Roberts, K., and Lacatusu, F. (2007). LCC’s GISTexter at DUC 2007: Ma-
chine reading for update summarization. In Proceedings of Document Understand-
ing Conference.
Hovy, E. and Lin, C.-Y. (1999). Automated text summarization in summarist. In Ad-
vances in Automatic Text Summarization, pages 82–94.
Hovy, E., yew Lin, C., and Zhou, L. (2005). Evaluating duc 2005 using basic elements.
In Proceedings of DUC-2005.
Jones, K. S. (1972). A statistical interpretation of term specificity and its application
in retrieval. Journal of Documentation, 28:11–21.
Jones, K. S. (1998). Automatic summarising: Factors and directions. In Advances in
Automatic Text Summarization, pages 1–12. MIT Press.
Jurafsky, D. and Martin, J. H. (2009). Speech and language processing: an introduction
to natural language processing, computational linguistics, and speech recognition.
Pearson Prentice Hall, Upper Saddle River, N.J., 2nd ed edition.
Katragadda, R. and Varma, V. (2009). Query-focused summaries or query-biased sum-
maries? In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages
105–108.
Klein, D. and Manning, C. D. (2003a). Accurate unlexicalized parsing. In Proceedings
of the 41st Annual Meeting on Association for Computational Linguistics - Volume
1, ACL ’03, pages 423–430, Stroudsburg, PA, USA. Association for Computational
Linguistics.
64
Klein, D. and Manning, C. D. (2003b). Fast exact inference with a factored model for
natural language parsing. In In Advances in Neural Information Processing Systems
15 (NIPS, pages 3–10. MIT Press.
Knight, K. and Marcu, D. (2000). Statistics-based summarization - step one: sentence
compression. In Proceedings of AAAI/IAAI, pages 703–710.
Lee, H., Peirsman, Y., Chang, A., Chambers, N., Surdeanu, M., and Jurafsky, D. (2011).
Stanford’s multi-pass sieve coreference resolution system at the conll-2011 shared
task. In Proceedings of the CoNLL-2011 Shared Task.
Lin, C. (2004). Rouge: A package for automatic evaluation of summaries. In Proceed-
ings of the workshop on text summarization branches out (WAS 2004).
Lin, C. and Hovy, E. (2000). The automated acquisition of topic signatures for text
summarization. In Proceedings of the 18th conference on Computational linguistics-
Volume 1, pages 495–501.
Lin, C. and Och, F. J. (2004a). Automatic evaluation of machine translation quality
using longest common subsequence and skip-bigram statistics. In Proceedings of the
42nd Annual Meeting fo the Association for Computational Linguistics.
Lin, C. and Och, F. J. (2004b). Orange: a method for evaluating automatic evaluation
metrics for machine translation. In Proceedings of the 20th international conference
on Computational Linguistics.
Lin, C.-Y. and Hovy, E. (2003). Automatic evaluation of summaries using n-gram co-
occurrence statistics. In Proceedings of the 2003 Conference of the North American
Chapter of the Association for Computational Linguistics on Human Language Tech-
nology - Volume 1, NAACL ’03, pages 71–78, Stroudsburg, PA, USA. Association for
Computational Linguistics.
Luhn, H. (1958). The automatic creation of literature abstracts. IBM Journal of re-
search and development, 2(2):159–165.
65
Manning, C. D. and Schutze, H. (1999). Foundations of statistical natural language
processing. MIT Press, Cambridge, Mass.
Marneffe, M. D., Maccartney, B., and Manning, C. D. (2006). Generating typed depen-
dency parses from phrase structure parses. In In LREC 2006.
McKeown, K. and Radev, D. R. (1995). Generating summaries of multiple news articles.
In Proceedings of the 18th annual international ACM SIGIR conference on Research
and development in information retrieval, SIGIR ’95, pages 74–82, New York, NY,
USA. ACM.
Miller, G. A. (1995). Wordnet: A lexical database for english. Communications of the
ACM, 38:39–41.
Molla, D. and Wan, S. (2006). Macquarie university at DUC 2006: Question answering
for summarisation. In Proceedings of Document Understanding Conference.
Moore, R. C. (2004). On log-likelihood-ratios and the significance of rare events. In
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Pro-
cessing, pages 333–340.
Nenkova, A. and McKeown, K. (2011). Automatic summarization. Foundations and
Trends in Information Retrieval, 5(2-3):103–233.
Nenkova, A. and Vanderwende, L. (2005). The impact of frequency on summarization.
Microsoft Research, Redmond, Washington, Tech. Rep. MSR-TR-2005-101.
Paice, C. D. (1990). Constructing literature abstracts by computer: techniques and
prospects. Inf. Process. Manage., 26:171–186.
Passonneau, R. J., Nenkova, A., Mckeown, K., and Sigelman, S. (2005). Applying the
pyramid method in duc 2005. In In Proceedings of the 2005 DUC Workshop.
Piao, S. S., Rayson, P., Archer, D., and McEnery, T. (2005). Comparing and combining
a semantic tagger and a statistical tool for MWE extraction. Comput. Speech Lang.,
19(4):378–397.
66
Pingali, P., Varma, V., and Katragadda, R. (2007). IIIT hyderabad at DUC 2007. In
Proceedings of Document Understanding Conference.
Radev, D. R., Blair-Goldensohn, S., and Zhang, Z. (2001). Experiments in single and
multi-document summarization using MEAD. In Proceedings of Document Under-
standing Conference.
Raghunathan, K., Lee, H., Rangarajan, S., Chambers, N., Surdeanu, M., Jurafsky, D.,
and Manning, C. (2010). A multi-pass sieve for coreference resolution. In Proceedings
of EMNLP 2010.
Salton, G. and Buckley, C. (1988). Term-weighting approaches in automatic text re-
trieval. In INFORMATION PROCESSING AND MANAGEMENT, pages 513–523.
Salton, G., Wong, A., and Yang, C. S. (1975). A vector space model for automatic index-
ing. Commun. ACM, 18:613–620.
Schafer, U., Kiefer, B., Spurk, C., Steffen, J., and Wang, R. (2011). The acl anthology
searchbench. In Proceedings of the ACL-HLT 2011 System Demonstrations, pages 7
– 13.
tau Yih, W., Goodman, J., Vanderwende, L., and Suzuki, H. (2007). Multi-document
summarization by maximizing informative content-words. In In Proceedings of
IJCAI-07 (The 20th International Joint Conference on Artificial Intelligence.
Toutanova, K., Brockett, C., Gamon, M., Jagarlamudi, J., Suzuki, H., and Vander-
wende, L. (2007). The pythy summarization system: Microsoft research at DUC
2007. In Proceedings of Document Understanding Conference.
Toutanova, K., Klein, D., Manning, C., and Singer, Y. (2003). Feature-rich part-of-
speech tagging with a cyclic dependency network. In Proceedings of the 2003 Confer-
ence of the North American Chapter of the Association for Computational Linguistics
on Human Language Technology-Volume 1, pages 173–180.
67
Vanderwende, L., Banko, M., and Menezes, A. (2004). Event-centric summary genera-
tion. In Proceedings of Document Understanding Conference.
Vanderwende, L., Suzuki, H., Brockett, C., and Nenkova, A. (2007). Beyond sumbasic:
Task-focused summarization with sentence simplification and lexical expansion. Inf.
Process. Manage., 43:1606–1618.
Zhou, Q., Sun, L., and Lu, Y. (2006). ISCAS at DUC 2006. In Proceedings of Document
Understanding Conference.
68
Appendix A
APPENDIX A
A.1 TAC 2010 and 2011 Experiments: Example Summaries and Topic Signa-tures
The following sections provide example summaries and topic signatures for the best
and worst performing runs of counting method and unit of information definition ex-
periments on TAC 2010 and 2011 data.
A.1.1 TAC 2010 Summary A Experiments
The following table describes the type of counting method and unit of information
definition used to produce a collection of summaries for TAC 2010 and TAC 2011 data.
The standard label refers to the standard method of counting information units in
a topic by their occurrence in the sentences of the documents in the foreground and
background corpus. The dependency label refers to the proposed method of counting
information units in a topic by their participation in the sentence-based dependency
relations of the document in the foreground and background corpus.
69
Table A.1: Description of Experiments
RUN ID Type of Count Definition of Unit of Information
ID 77 standard case-neutral lemmatized word combined with part-of-speech
restricted to nouns, verbs, and adjectives
ID 78 dependency case-neutral lemmatized word combined with part-of-speech
restricted to nouns, verbs, and adjectives
ID 79 standard word
ID 80 dependency word
ID 81 standard case-neutral lemmatized word
ID 82 dependency case-neutral lemmatized word
ID 83 standard case-neutral lemmatized word combined with part-of-speech
ID 84 dependency case-neutral lemmatized word combined with part-of-speech
Best Performing Run for an Individual Topic
The summary with the highest average ROUGE-1, ROUGE-2, and ROUGE-SU4 F-
measures in the TAC 2010 summary A comparison was produced by runs 79, 80, and
81 for topic D1024F-A.
Table A.2: TAC 2010 summary D1024F-A: best performing unit of information defini-
tion
RUN ID ROUGE-1 ROUGE-2 ROUGE-SU4
ID 79 0.50782 0.21628 0.22931
ID 80 0.50782 0.21628 0.22931
ID 81 0.50782 0.21628 0.22931
All three runs produced the same summary:
Clinton said the missiles hit terrorist camps in Afghanistan run by Osama
bin Laden, the Saudi millionaire blamed by Washington for the Aug. 7
70
bombings of U.S. embassies in Kenya and Tanzania, and a factory
linked to bin Laden in Sudan. The letter, sent to council president Danilo
Turk of Slovenia, was intended to lodge Sudan’s formal complaint that
Thursday’s U.S. airstrikes on a Khartoum pharmaceutical factory were a
breach of the U.N. charter and a violation of its sovereignty. U.S. officials
said the factory in Sudan made chemical weapons agents; Sudan
maintains it’s a pharmaceutical plant.
The following tables feature the top 20 information units that make up the topic sig-
natures for runs 79, 80, and 81 for TAC 2010 summary D1024F-A.
Table A.3: TAC 2010 summary D1024F-A: run 79 topic signature (top 20)
Information Unit LLR Score Information Unit LLR Score
factory 518.74282 attack 112.17554
Sudan 302.78869 missiles 108.96030
Sudanese 250.42743 weapons 94.77731
U 223.23142 chemical 93.00228
Laden 188.38143 embassies 92.26304
Khartoum 179.53285 plant 88.46446
S 140.13310 strikes 83.79101
bin 128.42394 Kenya 83.79101
missile 121.74675 Osama 82.65559
Clinton 115.95864 American 81.28940
71
Table A.4: TAC 2010 summary D1024F-A: run 80 topic signature (top 20)
Information Unit LLR Score Information Unit LLR Score
factory 1179.81867 strikes 209.62639
Laden 524.66512 divert 206.57105
U 433.34317 S 200.20465
embassies 308.50154 Kenya 188.08101
Sudanese 301.38225 bombings 186.00437
Sudan 283.55441 weapons 182.76680
plant 282.03231 Article 181.98718
attack 279.93579 Clinton 181.90102
Lewinsky 249.16919 self-defense 162.23661
missiles 226.77123 Khartoum 153.54951
Table A.5: TAC 2010 summary D1024F-A: run 81 topic signature (top 20)
Information Unit LLR Score Information Unit LLR Score
FACTORY 494.94632 S 139.67224
SUDAN 302.78869 PHARMACEUTICAL 130.97561
SUDANESE 250.42743 EL-BASHIR 126.55758
MISSILE 230.58770 CLINTON 115.46943
U 223.23142 STRIKE 112.66561
LADEN 182.86320 WEAPON 87.14052
KHARTOUM 179.53285 KENYA 83.79101
BIN 167.62697 OSAMA 82.65559
EMBASSY 154.88831 AMERICAN 79.55654
ATTACK 148.31698 AFGHANISTAN 78.56720
72
Table A.6: Comparison of results for all TAC 2010 D1024F-A Summaries
RUN ID ROUGE-1 ROUGE-2 ROUGE-SU4
ID 77 0.46626 0.17844 0.20916
ID 78 0.46398 0.17016 0.21354
ID 79 0.50782 0.21628 0.22931
ID 80 0.50782 0.21628 0.22931
ID 81 0.50782 0.21628 0.22931
ID 82 0.46642 0.17509 0.22399
ID 83 0.46659 0.19632 0.21830
ID 84 0.46398 0.17016 0.21354
Worst Performing Run for an Individual Topic
The summary with the lowest average ROUGE-1, ROUGE-2, and ROUGE-SU4 F-
measures in the TAC 2010 summary A comparison was produced by run 82 for topic
D1023E-A.
Table A.7: TAC 2010 summary D1023E-A: worst performing unit of information defi-
nition
RUN ID ROUGE-1 ROUGE-2 ROUGE-SU4
ID 82 0.14962 0.01511 0.04313
The summary for D1023E-A run 82:
Grant Hill, Tim Duncan, Kevin Garnett, Gary Payton, Tim Hardaway, Steve
Smith, Tom Gugliotta, Allan Houston and Vin Baker have been chosen
as the first nine members of the 2000 U.S. Olympics team, The Associated
Press learned today. Grant Hill, Tim Duncan, Kevin Garnett, Gary Payton,
Tim Hardaway, Steve Smith, Tom Gugliotta, Allan Houston and Vin Baker
have been chosen as the first nine members of the 2000 U.S. Olympics
73
team, The Associated Press learned today. Both avalanches rushed down
the Alps to the Galtuer resort nestling in the Paznauntal valley after
4 p.m. local time (1500 GMT).
The following table features the top 20 information units that make up the topic sig-
nature for run 82 for TAC 2010 summary D1023E-A.
Table A.8: TAC 2010 summary D1023E-A: run 82 topic signature (top 20)
Information Unit LLR Score Information Unit LLR Score
AVALANCHE 699.09858 ALBANIAN 141.72979
SNOW 574.26471 DIGGING 137.07591
RESORT 210.21897 AUSTRIAN 136.61398
GALTUER 186.29164 TODAY 132.10176
CLINTON 166.62981 CHALET 126.45269
HILL 161.08630 BURY 115.30729
KOSOVO 159.53228 RESCUER 114.60409
THUNDER 154.19949 FLY 113.97014
SNOWSLIDE 151.25360 TREATY 111.27681
AUSTRIA 142.63933 SCHOENHERR 109.95527
Table A.9: Comparison of results for all TAC 2010 D1023E-A Summaries
RUN ID ROUGE-1 ROUGE-2 ROUGE-SU4
ID 77 0.19647 0.01272 0.05177
ID 78 0.16129 0.00501 0.03653
ID 79 0.19395 0.02799 0.05694
ID 80 0.20205 0.00775 0.05390
ID 81 0.20611 0.01285 0.05493
ID 82 0.14962 0.01511 0.04313
ID 83 0.12082 0.00260 0.03128
ID 84 0.20205 0.00775 0.05390
74
A.1.2 TAC 2010 Summary B Experiments
Best Performing Run for an Individual Topic
The summary with the highest average ROUGE-1, ROUGE-2, and ROUGE-SU4 F-
measures in the TAC 2010 summary B comparison was produced by run 77 for topic
D1002A-B.
Table A.10: TAC 2010 summary D1002A-B: best performing unit of information defi-
nition
RUN ID ROUGE-1 ROUGE-2 ROUGE-SU4
ID 77 0.43678 0.17807 0.20875
The summary for D1002A-B run 77:
An appellate court ordered the trial of the four officers accused of killing
Amadou Diallo to be moved to Albany County, ruling on Thursday that a fair
trial would be impossible in the Bronx because of ‘‘the public clamor’’
about the case. The decision by a state appellate court to move the
criminal trial of four New York City police officers charged with the
killing of Amadou Diallo to Albany County seems unjustified. Jury
selection is scheduled to begin Jan. 31 in the trial of the four police
officers charged with killing Amadou Diallo, an unarmed West African
immigrant.
The following tables features the top 20 information units that make up the topic sig-
natures for run 77 for TAC 2010 summary D1002A-B.
75
Table A.11: TAC 2010 summary D1002A-B: run 77 topic signature (top 20)
Information Unit LLR Score Information Unit LLR Score
N DIALLO 673.04948 N COURT 106.02233
N OFFICER 567.29590 ADJ FAIR 100.11662
N ALBANY 365.32632 N AMADOU 99.73738
N BRONX 360.79015 V FIRE 88.25419
N TRIAL 291.21263 N YORK 86.79814
N SHOOTING 159.36965 ADJ APPELLATE 75.56220
N POLICE 139.78544 ADJ UNARMED 75.56213
N SHARPTON 132.12847 N LAWYER 66.05793
N SHOT 108.54161 N BULLET 64.49419
N CARROLL 107.82134 N JUSTICE 62.92783
Table A.12: Comparison of results for all TAC 2010 D1002A-B Summaries
RUN ID ROUGE-1 ROUGE-2 ROUGE-SU4
ID 77 0.43678 0.17807 0.20875
ID 78 0.32241 0.07682 0.10517
ID 79 0.41721 0.15810 0.18686
ID 80 0.41721 0.15810 0.18686
ID 81 0.42768 0.19314 0.20896
ID 82 0.41721 0.15810 0.18686
ID 83 0.42768 0.19314 0.20896
ID 84 0.41721 0.15810 0.18686
Worst Performing Run for an Individual Topic
The summary with the lowest average ROUGE-1, ROUGE-2, and ROUGE-SU4 F-
measures in the TAC 2010 summary B comparison was produced by run 78 for topic
D1030F-B.
76
Table A.13: TAC 2010 summary D1030F-B: worst performing unit of information defi-
nition
RUN ID ROUGE-1 ROUGE-2 ROUGE-SU4
ID 78 0.13486 0.00000 0.02921
The summary for D1030F-B run 78:
These parents’ stories echo those of thousands of others who have recently
discovered age-old folk remedies, often with the recommendation of family
doctors who are adding herbal remedies _ for example, echinacea to stave
off colds and flu, chamomile or lavender to treat colic, calendula to soothe
diaper rash and ginger root to quell queasy little stomachs _ to their
disease-fighting arsenal. _ chamomile tea to calm frazzled nerves and
relieve stomach cramps _ ginger root, grated and simmered in water, to
prevent nausea from a bout of stomach flu or motion sickness and to help
children fall asleep.
The following table features the top 20 information units that make up the topic sig-
nature for run 78 for TAC 2010 summary D1030F-B.
77
Table A.14: TAC 2010 summaryD1030F-B: run 78 topic signature (top 20)
Information Unit LLR Score Information Unit LLR Score
N EPHEDRON 552.52603 N EFFECT 130.80751
N HERB 441.21402 N ECHINACEA 129.28822
ADJ HERBAL 305.48807 N STIMULANT 126.12133
N SUPPLEMENT 263.11803 V REGULATE 124.64983
N BELT 224.22809 N MEDICINE 117.59554
V TAKE 187.06863 N DIGGING 116.95930
N MEDICATION 181.25910 N PEDIATRICIAN 114.25856
N REMEDY 174.97739 N BOOK 113.33318
N STROKE 170.12477 N TEA 111.07923
N WORKOUT 168.19437 N PRODUCT 110.44459
Table A.15: Comparison of results for all TAC 2010 D1030F-B Summaries
RUN ID ROUGE-1 ROUGE-2 ROUGE-SU4
ID 77 0.20759 0.01279 0.05291
ID 78 0.13486 0.00000 0.02921
ID 79 0.20698 0.01007 0.04739
ID 80 0.20698 0.01007 0.04739
ID 81 0.20759 0.01279 0.05291
ID 82 0.20759 0.01279 0.05291
ID 83 0.17067 0.00269 0.03705
ID 84 0.20698 0.01007 0.04739
A.1.3 TAC 2011 Summary A Experiments
Best Performing Run for an Individual Topic
The summary with the highest average ROUGE-1, ROUGE-2, and ROUGE-SU4 F-
measures in the TAC 2011 summary A comparison was produced by run 84 for topic
78
D1126E-A.
Table A.16: TAC 2011 summary D1126E-A: best performing unit of information defi-
nition
RUN ID ROUGE-1 ROUGE-2 ROUGE-SU4
ID 84 0.49082 0.19098 0.20702
The summary for topic D1126E-A run 84:
President Bush on Sunday made a valedictory visit to Iraq, the country that
will largely define his legacy, but the trip will more likely be remembered
for the unscripted moment when an Iraqi journalist hurled his shoes at
Bush’s head and denounced him on live television as a "dog" who had
delivered death and sorrow here from nearly six years of war. Muntazer
al-Zaidi jumped up as Bush held a press conference with Iraqi Prime
Minister Nuri al-Maliki, shouted "It is the farewell kiss, you dog" and
threw his footwear.
The following table features the top 20 information units that make up the topic sig-
nature for run 84 for TAC 2011 summary D1126E-A.
79
Table A.17: TAC 2011 summary D1126E-A: run 84 topic signature (top 20)
Information Unit LLR Score Information Unit LLR Score
NNP BUSH 747.21575 NNS SHOE 196.69451
NN SHOE 386.26835 NN PRESIDENT 193.84583
NNP AL-MALIKI 317.02511 NN REPORTER 187.67958
JJ IRAQI 288.63369 NN TRIP 182.89375
VBD THROW 266.15993 VBD DUCK 168.88735
NNP IRAQ 264.28265 NN CONFERENCE 161.78289
NN AGREEMENT 236.96789 NN JOURNALIST 149.18060
NNP BAGHDAD 235.83468 NN KISS 148.40481
NNS TROOPS 211.61704 NNS JOURNALIST 142.91354
VBD HURL 207.45424 NN SIZE 142.67232
Table A.18: Comparison of results for all TAC 2011 D1126E-A Summaries
RUN ID ROUGE-1 ROUGE-2 ROUGE-SU4
ID 77 0.41604 0.18481 0.20386
ID 78 0.38168 0.10540 0.13775
ID 79 0.41604 0.18481 0.20386
ID 80 0.41604 0.18481 0.20386
ID 81 0.41604 0.18481 0.20386
ID 82 0.42105 0.13165 0.16223
ID 83 0.41604 0.18481 0.20386
ID 84 0.49082 0.19098 0.20702
Worst Performing Run for an Individual Topic
The summary with the lowest average ROUGE-1, ROUGE-2, and ROUGE-SU4 F-
measures in the TAC 2011 summary A comparison was produced by runs 80, 82, and
84 for topic D1117C-A.
80
Table A.19: TAC 2011 summary D1117C-A: worst performing unit of information defi-
nition
RUN ID ROUGE-1 ROUGE-2 ROUGE-SU4
ID 80 0.15056 0.02028 0.03395
ID 82 0.15056 0.02028 0.03395
ID 84 0.15056 0.02028 0.03395
The summary for D1117C-A runs 80,82, and 84 is:
Becoming the first senior officer fired over the poor treatment of wounded
soldiers, Major General George Weightman "was informed this morning
that the senior army leadership had lost trust and confidence in the
commander’s leadership abilities to address needed solutions for
soldier-outpatient care at Walter Reed Army Medical Center," the army
said in a statement. But as far back as 2003, the commander of Walter
Reed, Lt. Gen. Kevin Kiley, who is now the Army’s top medical officer,
was told that soldiers who were wounded in Iraq and Afghanistan
were languishing and lost on the grounds, according to interviews.
The following tables feature the top 20 information units that make up the topic sig-
natures for run 80, 82, and 84 for TAC 2011 summary D1117C-A.
81
Table A.20: TAC 2011 summary D1117C-A: run 80 topic signature (top 20)
Information Unit LLR Score Information Unit LLR Score
Reed 782.82771 care 247.92184
Army 658.97329 facility 217.90137
Walter 468.65233 veterans 181.02575
soldiers 427.09281 wounded 180.00340
conditions 363.76941 Kiley 179.38107
Post 339.17713 mold 170.12644
Building 301.23550 Priest 163.38436
Gates 289.25123 Cody 157.46524
Center 286.72261 fix 153.34708
bureaucracy 277.00485 secretary 142.58853
Table A.21: TAC 2011 summary D1117C-A: run 82 topic signature (top 20)
Information Unit LLR Score Information Unit LLR Score
REED 782.82771 ARMY 218.34815
POST 623.74514 WALTER 217.53469
CENTER 468.65233 SOLDIER 215.72485
SECRETARY 436.37770 FACILITY 205.86814
KILEY 326.78160 GATES 179.38107
MOLD 289.25123 CONDITION 167.66782
PRIEST 280.72443 BUREAUCRACY 163.38436
BUILDING 270.93169 OUTPATIENT 161.47328
CODY 250.89293 CARE 157.46524
VETERAN 248.84959 FIX 146.45523
82
Table A.22: TAC 2011 summary D1117C-A: run 84 topic signature (top 20)
Information Unit LLR Score Information Unit LLR Score
NNP REED 1233.00641 VBN RELIEVE 338.54511
NNP ARMY 787.92119 NNS PROBLEM 322.73028
NNP WALTER 678.26920 NNP CENTER 280.95158
NNS SOLDIER 529.87476 NNS CONDITION 258.38487
NNP KILEY 492.97587 NN COMMANDER 241.12500
NN CARE 447.08921 NN OUTPATIENT 228.00529
NNP HARVEY 401.47876 NN COMMAND 201.51052
NNP GATES 376.40115 NNP YOUNG 198.37828
NNP WEIGHTMAN 354.53460 VB FIX 182.33912
NNP POST 349.57793 NN TREATMENT 180.97485
Table A.23: Comparison of results for all TAC 2011 D1117C-A Summaries
RUN ID ROUGE-1 ROUGE-2 ROUGE-SU4
ID 77 0.32633 0.07241 0.10706
ID 78 0.16898 0.03058 0.04146
ID 79 0.32866 0.07208 0.10519
ID 80 0.15056 0.02028 0.03395
ID 81 0.32866 0.07208 0.10519
ID 82 0.15056 0.02028 0.03395
ID 83 0.32633 0.07241 0.10706
ID 84 0.15056 0.02028 0.03395
A.1.4 TAC 2011 Summary B Experiments
Best Performing Run for an Individual Topic
The summary with the highest average ROUGE-1, ROUGE-2, and ROUGE-SU4 F-
measures in the TAC 2011 summary B comparison was produced by runs 79 and 81
83
for topic D1120D-B.
Table A.24: TAC 2011 summary D1120D-B: best performing unit of information defi-
nition
RUN ID ROUGE-1 ROUGE-2 ROUGE-SU4
ID 79 0.45609 0.16424 0.19731
ID 81 0.45609 0.16424 0.19731
The summary for D1120D-B runs 79 and 81:
Lake Mead, the vast reservoir for the Colorado River water that sustains
the fast-growing cities of Phoenix and Las Vegas, could lose water
faster than previously thought and run dry within 13 years, according
to a new study by scientists at the Scripps Institution of Oceanography.
The lake, located in Nevada and Arizona, has a 50 percent chance
of becoming unusable by 2021, the scientists say, if the demand for
water remains unchanged and if human-induced climate change
follows climate scientists’ moderate forecasts, resulting in a reduction
in average river flows.
The following tables feature the top 20 information units that make up the topic sig-
natures for runs 79 and 81 for TAC 2011 summary D1120D-B.
84
Table A.25: TAC 2011 summary D1120D-B: run 79 topic signature (top 20)
Information Unit LLR Score Information Unit LLR Score
Colorado 294.09002 drought 79.62392
water 205.89267 Utah 71.97667
Lake 181.08896 Arizona 66.15587
Mead 149.12705 dry 59.80377
climate 116.98762 reservoir 58.41668
River 112.31465 flows 51.53275
Barnett 106.15135 Pierce 46.58239
states 93.29591 Reclamation 46.58236
Nevada 89.67553 change 46.40240
Powell 80.49528 West 45.89424
Table A.26: TAC 2011 summary D1120D-B: run 81 topic signature (top 20)
Information Unit LLR Score Information Unit LLR Score
COLORADO 294.09002 RESERVOIR 79.80372
LAKE 213.34136 UTAH 71.97667
WATER 208.00949 ARIZONA 66.15587
RIVER 159.50094 CHANGE 59.62173
MEAD 149.12705 ENERGY 58.15841
CLIMATE 137.27292 DRY 55.12751
BARNETT 106.15135 RECLAMATION 46.58236
NEVADA 89.67553 ANALYSIS 44.68527
DROUGHT 83.33805 SCRIPPS 43.35238
POWELL 80.49528 CANYON 43.35238
85
Table A.27: Comparison of results for all TAC 2011 D1120D-B Summaries
RUN ID ROUGE-1 ROUGE-2 ROUGE-SU4
ID 77 0.36970 0.09598 0.14078
ID 78 0.28645 0.07562 0.08978
ID 79 0.45609 0.16424 0.19731
ID 80 0.33633 0.05966 0.10031
ID 81 0.45609 0.16424 0.19731
ID 82 0.33633 0.05966 0.10031
ID 83 0.28645 0.07562 0.08978
ID 84 0.32617 0.06386 0.09918
Worst Performing Run for an Individual Topic
The summary with the lowest average ROUGE-1, ROUGE-2, and ROUGE-SU4 F-
measures in the TAC 2011 summary B comparison was produced by all runs for topic
D1112C-B.
Table A.28: TAC 2011 summary D1112C-B: worst performing unit of information defi-
nition
RUN ID ROUGE-1 ROUGE-2 ROUGE-SU4
ID 77 0.17281 0.00257 0.03875
ID 78 0.17281 0.00257 0.03875
ID 79 0.17281 0.00257 0.03875
ID 80 0.17281 0.00257 0.03875
ID 81 0.17281 0.00257 0.03875
ID 82 0.17281 0.00257 0.03875
ID 83 0.17281 0.00257 0.03875
ID 84 0.17281 0.00257 0.03875
The summary for D1112C-B for all runs:
86
Along with Romero and McKeown, those killed were sheriff’s Deputy James Tutino,
47, of Simi Valley in Ventura County, who took the commuter train occasionally
to get to work at the Men’s Central Jail in downtown Los Angeles; Elizabeth Hill,
65, of Van Nuys; Manuel Alcala, 51, of West Hills; Julia Bennett, 44, of Simi
Valley; Alonso Caballero of Winnetka; Don Wiley, 58, of Simi Valley; William
Parent, 53, of Canoga Park; Thomas Ormiston, 58, of Northridge, who was was
nearing retirement in a railroad career that began in 1970; and Henry Kilinski,
39, of Orange in Orange County.
The following table features the top 20 information units that make up the topic sig-
nature for run 78 for TAC 2010 summary D1112C-B.
Table A.29: TAC 2011 summary D1112C-B: run 77 topic signature (top 20)
Information Unit LLR Score Information Unit LLR Score
N ALVAREZ 1175.35689 N MURDER 108.21355
N JURY 296.09226 N JUAN 102.18704
N PENALTY 173.60611 N DEATH 93.54109
N METROLINK 164.22542 N SENTENCING 92.64510
N TRAIN 152.42272 N ROMERO 87.00721
N JUROR 135.76091 N JUDGE 79.44059
N LIFE 127.12369 N MANUEL 79.33694
N POUNDERS 117.80438 N APPEAL 76.23846
N TRACK 115.08398 N PAROLE 73.62145
N DERAILMENT 110.09067 N SUPERIOR 71.68222