Calculating LLR Topic Signatures with Dependency Relations

Calculating LLR Topic Signatures with Dependency Relationsfor Automatic Text Summarization

Prescott P. Klassen

A thesis submitted in partial fulfillment of therequirements for the degree of

Master of Science

University of Washington

2012

Program Authorized to Offer Degree:Computational Linguistics

University of Washington

Abstract

Calculating LLR Topic Signatures with Dependency Relations for Automatic TextSummarization

Prescott P. Klassen

Chair of the Supervisory Committee:Professor Fei Xia

Linguistics

Topic Signatures based on Log Likelihood Ratio (LLR) values have been a staple of Au-

tomatic Text Summarization since originally proposed over a decade ago. In my thesis

I propose an alternate method for counting information units and calculating the fore-

ground and background probabilities for LLR calculations based on the participation

of an information unit in dependency relations generated from a sentence rather than

the sentence itself. I develop a generic text summarization system based on the Text

Analysis Conference shared task guidelines and data in order to compare the proposed

method of counting with the standard approach in the context of an applied task. Each

counting method and unit of information definition is run as an experiment on TAC

2010 and TAC 2011 topic-based document collections and evaluated against human

model summaries using ROUGE statistical measure of n-gram overlap. Although the

results of the experiments are inconclusive, the topic signatures generated by the two

approaches are different in the information units they contain. I conclude that an al-

ternate evaluation framework and a semi-abstractive approach leveraging dependency

relations themselves for summary generation are possible areas for future work and

research.

TABLE OF CONTENTS

Page

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Chapter 2: Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Overview of Summarization Systems . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Single or Multiple Document . . . . . . . . . . . . . . . . . . . . . 62.1.2 Extractive or Abstractive . . . . . . . . . . . . . . . . . . . . . . . . 62.1.3 Generic or Query-focused . . . . . . . . . . . . . . . . . . . . . . . 72.1.4 General or Domain Specific . . . . . . . . . . . . . . . . . . . . . . 72.1.5 Initial or Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Early Extractive Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Frequency-based Approaches to Sentence Extraction . . . . . . . . . . . 10

2.3.1 Word Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.2 Term Frequency/Inverse Document Frequency . . . . . . . . . . . 112.3.3 Log Likelihood Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Abstractive approaches to text summarization . . . . . . . . . . . . . . . 152.4.1 SUMMONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.2 RALI-DIRO at TAC 2010 . . . . . . . . . . . . . . . . . . . . . . . . 162.4.3 Human EXtraction for TAC: HexTac . . . . . . . . . . . . . . . . . 16

2.5 Summarization Shared Tasks . . . . . . . . . . . . . . . . . . . . . . . . . 172.5.1 Document Understanding Conference . . . . . . . . . . . . . . . . 172.5.2 Text Analysis Conference . . . . . . . . . . . . . . . . . . . . . . . 18

2.6 DUC/TAC Evaluation Frameworks . . . . . . . . . . . . . . . . . . . . . . 202.6.1 Pyramid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.6.2 Readability/Fluency and Responsiveness . . . . . . . . . . . . . . 212.6.3 Recall-Oriented Understudy for Gisting Evaluation (ROUGE) . . 21

i

2.6.4 Basic Elements (BE) . . . . . . . . . . . . . . . . . . . . . . . . . . 222.7 Natural Language Processing Software Libraries . . . . . . . . . . . . . 22

Chapter 3: Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.1 Topic Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 Calculating LLR with Dependency Relations . . . . . . . . . . . . . . . . 283.3 TAC 2011 Guided Summarization Task . . . . . . . . . . . . . . . . . . . 30

3.3.1 TAC Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.4 Development and Testing Data . . . . . . . . . . . . . . . . . . . . . . . . 313.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.6 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.6.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.6.2 Stanford coreNLP Annotation Pipeline . . . . . . . . . . . . . . . 353.6.3 Information Unit Extraction . . . . . . . . . . . . . . . . . . . . . . 353.6.4 LLR Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.6.5 Sentence Selection and Ranking . . . . . . . . . . . . . . . . . . . 363.6.6 Summary Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Chapter 4: Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.1 Systems Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2 System I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2.1 System I Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 424.3 System II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.3.1 System II Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 464.4 System III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.4.1 System III Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 484.5 Comparison of Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Chapter 5: Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . 505.1 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.2 TAC 2010 ROUGE Average F-measure Results . . . . . . . . . . . . . . . 525.3 TAC 2011 ROUGE Average F-measure Results . . . . . . . . . . . . . . . 53

Chapter 6: Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 566.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

ii

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Appendix A: Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68A.1 TAC 2010 and 2011 Experiments: Example Summaries and Topic Sig-

natures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68A.1.1 TAC 2010 Summary A Experiments . . . . . . . . . . . . . . . . . 68A.1.2 TAC 2010 Summary B Experiments . . . . . . . . . . . . . . . . . 74A.1.3 TAC 2011 Summary A Experiments . . . . . . . . . . . . . . . . . 77A.1.4 TAC 2011 Summary B Experiments . . . . . . . . . . . . . . . . . 82

iii

LIST OF FIGURES

Figure Number Page

2.1 Basic Summarization Process . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1 System Program Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

iv

LIST OF TABLES

Table Number Page

2.1 Sample aspects from TAC annual shared-tasks . . . . . . . . . . . . . . . 7

3.1 Comparison of simple sentence-based probabilities calculated by wordcount and by participation in dependency role count . . . . . . . . . . . . 29

3.2 Text Analysis Conference 2011 Schedule . . . . . . . . . . . . . . . . . . . 313.3 TAC Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1 System I: Application components . . . . . . . . . . . . . . . . . . . . . . 394.2 System I Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.3 System II: Application Components . . . . . . . . . . . . . . . . . . . . . . 444.4 System II evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.5 System III evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.6 System Comparison: TAC 2010 Summary A/B ROUGE average F-measure 484.7 System Comparison: TAC 2011 Summary A/B ROUGE average F-measure 49

5.1 Description of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 515.2 Experiment Results: TAC 2010 Summary A ROUGE Average F-measures 525.3 Experiment Results: TAC 2010 Summary B ROUGE Average F-measures 535.4 Experiment Results: TAC 2011 Summary A ROUGE Average F-measures 545.5 Experiment Results: TAC 2011 Summary B ROUGE Average F-measures 55

A.1 Description of Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 69A.2 TAC 2010 summary D1024F-A: best performing unit of information def-

inition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69A.3 TAC 2010 summary D1024F-A: run 79 topic signature (top 20) . . . . . 70A.4 TAC 2010 summary D1024F-A: run 80 topic signature (top 20) . . . . . 71A.5 TAC 2010 summary D1024F-A: run 81 topic signature (top 20) . . . . . 71A.6 Comparison of results for all TAC 2010 D1024F-A Summaries . . . . . . 72A.7 TAC 2010 summary D1023E-A: worst performing unit of information

definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72A.8 TAC 2010 summary D1023E-A: run 82 topic signature (top 20) . . . . . 73A.9 Comparison of results for all TAC 2010 D1023E-A Summaries . . . . . . 73

v

A.10 TAC 2010 summary D1002A-B: best performing unit of information def-inition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

A.11 TAC 2010 summary D1002A-B: run 77 topic signature (top 20) . . . . . 75A.12 Comparison of results for all TAC 2010 D1002A-B Summaries . . . . . . 75A.13 TAC 2010 summary D1030F-B: worst performing unit of information def-

inition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76A.14 TAC 2010 summaryD1030F-B: run 78 topic signature (top 20) . . . . . . 77A.15 Comparison of results for all TAC 2010 D1030F-B Summaries . . . . . . 77A.16 TAC 2011 summary D1126E-A: best performing unit of information def-

inition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78A.17 TAC 2011 summary D1126E-A: run 84 topic signature (top 20) . . . . . 79A.18 Comparison of results for all TAC 2011 D1126E-A Summaries . . . . . . 79A.19 TAC 2011 summary D1117C-A: worst performing unit of information

definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80A.20 TAC 2011 summary D1117C-A: run 80 topic signature (top 20) . . . . . 81A.21 TAC 2011 summary D1117C-A: run 82 topic signature (top 20) . . . . . 81A.22 TAC 2011 summary D1117C-A: run 84 topic signature (top 20) . . . . . 82A.23 Comparison of results for all TAC 2011 D1117C-A Summaries . . . . . . 82A.24 TAC 2011 summary D1120D-B: best performing unit of information def-

inition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83A.25 TAC 2011 summary D1120D-B: run 79 topic signature (top 20) . . . . . 84A.26 TAC 2011 summary D1120D-B: run 81 topic signature (top 20) . . . . . 84A.27 Comparison of results for all TAC 2011 D1120D-B Summaries . . . . . . 85A.28 TAC 2011 summary D1112C-B: worst performing unit of information

definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85A.29 TAC 2011 summary D1112C-B: run 77 topic signature (top 20) . . . . . 86

vi

ACKNOWLEDGMENTS

I would like to thank Dr. Fei Xia and Dr. Scott Farrar for their advice, patience,

and encouragement throughout the process of researching and writing my thesis. I

would also like to acknowledge David Brodbeck, our system administrator, for always

being available to help with systems and software issues, Joyce Parvi and Mike Furr

for all their help with administrative tasks, and Dr. Emily Bender for designing a

degree program that inspired me to make the life-changing transition from industry to

academia.

vii

1

Chapter 1

INTRODUCTION

For over 50 years, computers have been used to automatically generate summaries of

text documents. One of the earliest systems developed in 1958 by H.P. Luhn at IBM to

improve the quality of document abstracts, employed a statistical approach to the gen-

eration of summaries for scientific and technical journal articles (Luhn, 1958). It was

the first system to use the frequency of a word in a document as a measure of its impor-

tance as a descriptor of the document’s overall topic. Luhn demonstrated that words

within a high/low threshold could be used to rank the overall descriptiveness of sen-

tences in a document. Top-ranking sentences could then be automatically assembled

into a summary of the document. Luhn’s early approach established the fundamental

tasks of automatic extractive text summarization. Many new statistical approaches

for ranking and selecting sentences have been implemented in recent years, as well

as complex sentence editing and “smoothing” techniques to improve the readability of

assembled sentences, but at their core, the majority of systems extract sentences from

one or more documents to automatically create summaries.

An example of a more complex state-of-the-art, yet fundamentally extractive sys-

tem, is the Clustering, Linguistics, and Statistics for Summarization Yield (CLASSY)

system developed by the Institute for Defense Analysis (IDA)/Center for Computing

Sciences. The CLASSY system has been an annual participant in the Summariza-

tion track of the Text Analysis Conference (TAC) sponsored by the National Institute

for Standards and Technology (NIST) and its previous annual workshop, the Docu-

ment Understanding Conference (DUC). It is an example of an automatic text summa-

rization that has been incrementally improved over a period of ten years. Each year

it has performed either at the top or close to the top of all participating automated

summarization systems submitted to the annual DUC/TAC evaluations. CLASSY is

2

a multi-component solution that can be decomposed into seven different logical mod-

ules; (1) Complex data preparation using corpus-specific techniques, (2) Query term

selection and expansion, (3) Signature term selection using LLR and a significantly

large background corpus, (4) Sentence scoring using an approximate oracle, (5) Pro-

jection of term-sentence matrices against the base summary to reduce redundancy for

update summaries, (6) Redundancy removal and sentence selection via LSI/L1-QR al-

gorithm followed by an Integer Linear Program (ILP), and (7) Sentence ordering using

an approximate Traveling Salesperson Program (TSP).

In my thesis I compare two methods of calculating Log Likelihood Ratio (LLR) for

ranking and selecting sentences for extraction within the framework of a generic ex-

tractive text summarization system. Both approaches rely on topic signatures (Lin and

Hovy, 2000) to summarize a multi-document corpus. A topic signature is made up of

units of information that are considered statistically more likely to occur within a set of

documents about the same topic than in a larger, more general set of documents. The

statistical measure used to determine topic signatures is LLR (Dunning, 1993) which

is also referred to as G2 in statistical literature (Moore, 2004). In text summarization,

a topic signature is used to weight units of information in a sentences, sentences are

ranked based on the aggregate score of the weighted units of information they contain,

and the best sentences are extracted to form a summary. In the original application of

topic signatures, units of information were defined simply as words.

The two approaches that I compare differ in the method of counting units of infor-

mation in the calculation of an LLR. The first follows closely the approach described

by (Nenkova and McKeown, 2011) where foreground and background counts of units

of information are calculated by counting units as they occur in the sentences of the

documents in the corpus. The second is a novel approach I propose that counts units of

information as they occur in sentence-based dependency relations rather than in the

sentences themselves. The sentence-based dependency relations are created by the

Stanford coreNLP dependency parser and are represented in collapsed and propagated

form (Marneffe et al., 2006). I compare and contrast the two methods for counting us-

ing multiple definitions of information units including words, lemmatized case-neutral

3

words, lemmatized case-neutral words combined with a part-of-speech tag, and lem-

matized case-neutral words combined with a generalized part-of-speech tag restricted

to nouns, verbs, and adjectives.

The software systems used to compare the two counting methods were designed

based on the guided summarization task guidelines of the 2011 Text Analysis Con-

ference (TAC). They were developed and tested using TAC 2010 and TAC 2011 data.

Three generic text summarization systems were built and evaluated before a final

system was selected as the platform to compare the two counting methods and ex-

periment with multiple information unit definitions. The summaries generated by

System I were evaluated by both human assessors and automatic evaluation tools as

part of the official TAC 2011 lifecycle. The final set of summaries generated by System

II and System III were evaluated based solely on statistical measures of their over-

lap with human model summaries using the Recall-Oriented Understudy for Gisting

Evaluation (ROUGE) automatic evaluation tool and the 2010 and 2011 Text Analysis

Conference evaluation data.

Although the results of the experiments are inconclusive, the topic signatures gen-

erated by the two approaches are different in the information units they contain. I con-

clude that an alternate evaluation framework and a semi-abstractive approach lever-

aging dependency relations themselves for summary generation are possible areas for

future work and research.

My thesis is organized into six chapters: introduction, literature review, method-

ology, implementation, experiments and results, and conclusion and future work. In

my literature review I provide a high-level overview of the field of automatic text sum-

marization beginning with definitions and a brief review of early systems. I then de-

scribe select extractive statistical approaches with a focus on frequentist measures

and contrast the extractive with a description of the challenge of abstractive text sum-

marization and highlight select abstractive systems. To situate my generic extractive

systems, I provide an overview of the recent history of text summarization shared

tasks, the Document Understanding Conference (DUC) and the Text Analysis Con-

ference (TAC) as well as a summary of the evaluation frameworks and corpora that

4

have emerged from these conferences. I conclude with a survey of natural language

processing tool suites that enable the rapid development of text processing systems

and underpin the generic extractive summarization systems I developed for this the-

sis. The methodology chapter describes the statistical methods, algorithms, data, and

evaluation framework that I used to compare my two strategies as well as a descrip-

tion of the logical architecture, program flow, and open source components I integrated

into my solution. In the implementation chapter I detail the tactics and decisions that

were required when I built out my three systems and how these implementation de-

cisions required changes to my original design. The experiments and results chapter

describes the design, execution, and results of the experiments I set up to compare the

two strategies and multiple definitions of information units. The thesis concludes with

a conclusion and future work chapter that summarizes the results and conclusions of

my experiments and describes possible future work, focusing on the integration of ab-

stractive techniques to enhance the linguistic quality of the summarization system.

An appendix includes sample summaries and topic signatures for the best and worst

runs for individual topics in the experiments on TAC 2010 and 2011 data.

5

Chapter 2

LITERATURE REVIEW

The literature for automatic text summarization is extensive and covers many sub-

domains. In my literature review I will focus primarily on extractive and frequentist

approaches. I begin by defining important terms and a high-level overview of text

summarization including a brief description of early systems and their impact on the

domain. I then describe select extractive statistical approaches with a focus on fre-

quentist measures and contrast the extractive with a description of the challenge of

abstractive text summarization and highlight select abstractive systems. To situate

the generic extractive systems I built for my thesis, I provide an overview of the recent

history of text summarization shared tasks, the Document Understanding Conference

(DUC) and the Text Analysis Conference (TAC) as well as a summary of the evaluation

frameworks and corpora that have emerged from these conferences. I conclude with a

survey of natural language processing tool suites that enable the rapid development of

text processing systems and underpin the generic extractive summarization system I

developed for this thesis.

2.1 Overview of Summarization Systems

In her 1998 self-described “call to arms”, Sparck Jones defines a summary as a “re-

ductive transformation of source text to summary text through content reduction by

selection and/or generalization on what is important in the source” (Jones, 1998). She

outlines the basic process of summarization as a three step process: (1) source inter-

pretation to source text representation, (2) source representation transformation to

summary text representation, (3) summary text generation from summary represen-

tation. Although most systems decompose these three process stages into many more

subtasks and modules, the high-level process model described by Sparck Jones and

6

her definition of a summary can be applied to most automatic text summarization sys-

tems. They are differentiated by how they implement the process model, the amount

and scope of source text, the type of summary they create, and how they generate the

final text of the summary representation.

Automatic summarization systems can be categorized on multiple dimensions in-

cluding: (1) single or multiple document, (2) extractive or abstractive, (3) generic or

query-focused, (4) general or domain specific, (5) initial or update.

2.1.1 Single or Multiple Document

The first summarization systems were built to summarize a single document. As more

documents became digitized and access to large collections of text came online in the

1990s, summarization was expanded to multiple documents. An early news aggregator,

SUMMONS (SUMMarizing Online NewS articles), created a summary from a series

of Associated Press and Reuters newswire articles and is one of the few abstractive

systems in the automatic text summarization literature (McKeown and Radev, 1995).

Single document summarization is still a difficult task. In 2003 DUC retired the

single document summarization task because no system was able to beat the baseline

of the first sentence of an article (Nenkova and McKeown, 2011).

2.1.2 Extractive or Abstractive

The majority of summarization systems that have been developed are extractive. Ex-

tractive summarizers identify the best candidate sentences and use those sentences,

sometimes exactly as they appear in the original document, to create a summary. Cur-

rent state-of-the-art extractive summarizers usually employ some transformational

strategies to smooth or enhance the readability of the generated summary by prun-

ing, editing, or replacing parts of the extracted sentences. An abstractive summarizer,

like a human summarizer, will generate sentences based on an understanding of what

information is most important in the documents. Words and phrases in an abstrac-

tive summary may come from other sources other than the original documents, like a

7

lexical database, ontology, or language generation templates.

2.1.3 Generic or Query-focused

A generic summary is one that is generated without any query or context. The only

guide to the content of the summary is the input sentences. Many early multiple

document summarizers produced generic summaries. Query-focused summaries are

typically bounded by a query and are similar to open-ended question and answering

systems. Queries are typically one or more sentences of natural language. For exam-

ple, the DUC guided summarization task for 2007 was query-focused. The query was

defined as a topic title accompanied by one or more interrogative sentences concerning

specific aspects of the topic. Category-based aspect-oriented queries have been part of

the TAC guided summarization task since 2009. Topics are sorted into broad-based

categories (for example: Accidents or Natural Disasters) which have a template of as-

pects that are expected to be covered in the summary. One of the evaluation metrics

for aspect-oriented summaries is a measure of responsiveness, how many aspects were

covered in the summary. Table 2.1 features samples of aspects from TAC shared tasks.

Table 2.1: Sample aspects from TAC annual shared-tasks

Aspect Definition

WHAT what happened

WHEN date, time, other temporal placement markers

WHERE physical location

2.1.4 General or Domain Specific

Domain specific summarization systems focus on a specific domain like medical re-

search or scientific articles. Often domain specific knowledge bases or ontologies are

incorporated into the system to provide additional information to assist in sentence

selection or generation.

8

2.1.5 Initial or Update

An update summary provides only new information that was not originally summa-

rized by an initial update. The update task has been a component of DUC and TAC

since its pilot in DUC 2007.

2.2 Early Extractive Systems

The Automated Creation of Literature Abstracts (Luhn, 1958) by H.P. Luhn , is con-

sidered one of the earliest articles to explore using statistical measures and computer

software to automate the summary of a text document (Nenkova and McKeown, 2011).

Luhn describes an extractive summarization system for scientific journals and techni-

cal articles based on the insight that some words in a document are most descriptive of

its topic, and that sentences that contain those words are the best candidates to extract

to form a summary. He defines descriptive words as those words that occur within the

bounds of a low and high frequency threshold. Luhn makes the argument that words

that occur most frequently in a document are words that occur most frequently in all

documents and are not descriptive of the topic. Pronouns and determiners are exam-

ples of these kinds of frequent words and he excludes them by implementing a stop

word list. Words that occur too infrequently are also not indicative of the topic and are

excluded from the descriptive class. Finally, sentences are ranked based on the number

of descriptive words that occur within five-word clusters within a sentence, the highest

rank selected for inclusion in the summary. Luhn’s approach of using statistical mea-

sures of word frequency to extract sentences for summary generation is the foundation

on which many subsequent automatic text summarization systems were built.

Edmundson (Edmundson, 1969) expanded on the work of Luhn and introduced the

use of non-word features and training corpora into the development of extractive sum-

marization systems. He defined three features in addition to the number of times a

word occurs in a document to weight sentences: (1) the number of words in the sentence

that occur in the title or the section headings of the government, (2) position of the sen-

tence in the overall document and within a section, (3) the number of words within a

9

sentence that match a pre-compiled domain-specific list of cue words (Nenkova and

McKeown, 2011). Edmundsen also used a corpus of documents and summaries to both

determine feature weights and perform evaluation for his system.

Another early innovator was Paice (Paice, 1990), who addressed reference resolu-

tion, an inherent issue in extractive summary generation. Extractive systems select

the best representative sentences for a topic. The selected sentences are usually not

contiguous, leaving anaphora and cataphora unresolved, effecting the understanding

and readability of the generated summary. Paice proposed a template-based system

that matched exophora to a pre-built list in order to add sentences before or after the

selected sentence. He also described a system that would replace anaphora and cat-

aphora with the appropriate reference, but did not actually implement this solution

(Nenkova and McKeown, 2011).

Figure 2.1 illustrates the basic processes of extractive summarization systems.

Sentence Pre-Processing

Sentence Extraction

Summary Generation

Figure 2.1: Basic Summarization Process

10

2.3 Frequency-based Approaches to Sentence Extraction

Frequency-based approaches to sentence extraction are used in many unsupervised

summarization systems. The most basic form of frequency measure, raw frequency, is

biased by length of the document, so additional more complex measures of frequency

are typically used. Three frequentist approaches used in summarization are: (1) Word

probability, (2) Term Frequency/Inter-document Frequency (TF/IDF), (3) Log Likeli-

hood Ratio.

2.3.1 Word Probability

The word probability approach is the simplest measure and is based on the basic prob-

ability of a word w given the count of all words cw in the output N.

p(w) =c(w)

N(2.1)

The SumBasic system (Nenkova and Vanderwende, 2005) implements word probabil-

ity to assign weight to input sentences. Each sentence S is weighted by the average

probability of the content words p(w) it contains by the formula:

Weight(Sj) =

∑wi∈Sj

p(wi)

| {wi | wi ∈ Sj} |(2.2)

A stop word list is used to filter non-content words from the count. SumBasic selects

the highest scoring sentence containing the highest probable word. The assumption

is that the highest probable content word is indicative of the topic of the document

and a sentence containing this word as well as the highest average probability of other

content words is the best candidate for a summary. Based on the evidence that the

probability of a word occurring twice in a summary is less than the probability of a

word occurring only once in a summary (Nenkova and Vanderwende, 2005), the se-

lected sentence’s content words are re-ranked based on the square of their probability,

reducing the chance of duplicate content words are selected for the summary. The

selection process is repeated using the highest probable content word to rank subse-

11

quent candidate sentences. The process is repeated until the summary maximum word

length is reached.

The straightforward use of frequency in the SumBasic system performs surpris-

ingly well, ranking statistically among the top system in DUC 2004 and MSE 2005

(Nenkova and Vanderwende, 2005) using the ROUGE-1 measurement. Improvements

to the SumBasic approach are described in (tau Yih et al., 2007). In addition to fre-

quency the next iteration of the system uses both frequency and position, which are

combined by a discriminative machine learning-based algorithm and instead of heuris-

tic greedy sentence selection, an optimization process on the complete summary. The

improved system ranks by ROUGE-1 statistically at the top of DUC 2004 and MSE

2005 systems.

2.3.2 Term Frequency/Inverse Document Frequency

Term Frequency/Inverse Document Frequency (TF/IDF) (Salton and Buckley, 1988) is

a long-standing approach in information retrieval and text summarization for statis-

tically measuring the importance of a word in a document based on its proportional

frequency in a corpus (Jurafsky and Martin, 2009). The first component of TF/IDF is

term frequency and is a count of a term within a document normalized for length. This

value, term frequency, is then divided by the count of the number of documents which

contain the term, called inverse document frequency. A Log of the quotient is then used

as the weight of the term in the corpus. To compensate for terms that occur in zero or

one document in the corpus, one is added to the quotient to avoid a zero denominator).

The formula for term frequency is:

tfi,j =ni,j∑k nk,j

(2.3)

Where the number of occurrences of term i in document j represented by ni ,j and is

normalized by the number of terms in the whole document. The inverse document

12

frequency (Jones, 1972) is defined as:

idfi = log| D |

1+ | {d : ti ∈ d} |(2.4)

Where the total number of documents int the corpus is divided by the number of docu-

ments with the given term i. Each term in each document is given a TF/IDF score:

(tf/idf)i,j = tfi,j ×idfi (2.5)

TF/IDF weights are good indicators of which terms in a document are the most de-

scriptive content words in that document.

When a summary is query-focused, a Euclidean Distance or Cosine Similarity mea-

sure can be used to find the sentence vector with the smallest angle distance from

the query. Feature vectors for each document are created using term/tf/idf score fea-

ture/value pairs and compared to the query vector for each document using a cosine

similarity measure (Salton et al., 1975):

sim(q, dj) =

∑Ni=1wi,q xwi,j√∑N

i=1w2i,qx√∑N

i=1w2i,j

(2.6)

In (Hovy and Lin, 1999) the concept of term signatures is developed for an early ver-

sion of the Summarist system. TF/IDF weighting is used to find 300 signature content

words for 32 categories of documents in 30,000 texts of the Wall St. Journal. The

32 classes of 300 signature terms are then used to classify 2,204 unseen documents

from the same Wall St. Journal corpus using a cosine similarity measure. Precision of

0.69309 and recall of 0.75662 were reported, inline with information retrieval results

at the time. Although a summary generation module is only described and not imple-

mented in the paper, the classifying term signature used to classify the texts could also

be used to select and rank sentences based on smallest cosine angle between vectors of

signature and sentence in a hyperplane.

13

2.3.3 Log Likelihood Ratio

In his paper describing the challenge of finding an appropriate statistical model of

distribution for natural language processing, Dunning introduces the Log Likelihood

Ratio(LLR) (Dunning, 1993). He argues LLR is a good option for representing sparse

data distributions like that of words in a corpus. He equates counting words in a

corpus to a Bernoulli trial. Each test of a word matching a prototype has a probability

p and the probability of the next n matching the prototype is a random variable K in a

binomial distribution:

p(K = k) = pk(1− p)n−k(n

k

)(2.7)

whose mean is np and variance is np(1-p). Dunning demonstrates that if np(1-p) >

5 then the discrete binomial distribution approximates the continuous normal distri-

bution, however when np(1-p) < 5 and more so when it is np(1-p) < 1, the error when

approximating using the normal distribution gets larger. The nature of word frequency

is such that many words would occur rarely in a document. Because of this, Dunning

suggests another class of test that does not depend so much on normality, the log like-

lihood ratio. Moore refers to (Dunning, 1993) as introducing the NLP community to

this statistic and additionally labels it G2 log-likelihood-ratio (Moore, 2004).

First implemented for summarization as a statistic for calculating topic signatures

in (Lin and Hovy, 2000), LLR compares two hypotheses about the probability of a word

in a foreground corpus and the probability of the same word in a background corpus.

Hypothesis1 : P (w|I) = P (w|B)

Hypothesis2 : P (w|I) > P (w|B)(2.8)

where I is the foreground corpus and B is the background corpus. If Hypothesis 2 holds,

then w is descriptive of the foreground corpus. As described above, the probability p of

a word w is a Bernoulli trial and a binomial distribution:

p(K = k) = pk(1− p)n−k(n

k

)(2.9)

14

The two hypotheses are compared using a likelihood ratio of their probabilities:

λ =L(p, k1, n1)L(p, k2, n2)

L(p1, k1, n1)L(p2, k2, n2)(2.10)

where likelihood is calculated 1:

L(p, k, n) = pk(1− p)n−k (2.11)

and the probabilities are defined as

p1 =k1n1, p2 =

k2n2, p =

k1 + k2n1 + n2

(2.12)

where k1 is the count of a word within the foreground, n1 is the total number of words

in the foreground, k2 is the count of the word within the background corpus, and n2 is

the total number of words in the background corpus. Dunning reduces the formula to

calculate for −2logλ as:

−2logλ = 2[logL(p1, k1, n1) + logL(p2, k2, n2)

−logL(p, k1, n1)− logL(p, k2, n2)](2.13)

which correlates to the statistic chi-squared. The chi-squared distribution model can

be used to establish statistical thresholds for determining topic signatures. In (Lin and

Hovy, 2000) the cut off weight for −2logλ was set at 10.83 with confidence level α t =

0.001 (chi-squared lookup).

In (Moore, 2004), the relationship between G2 and Mutual Information is explored

by creating derivations of Dunnings’ original formula. The last derivation demon-

strates that G2 equals 2N times the formula for the average Mutual Information of

two random variables. The correlation with mutual information means that G2 two

important characteristics, (1) it can be used to measure word association independent

of significance, (2) like mutual information, it is independent of corpus size, and can be

used in corpus of varying size. See (Moore, 2004) for alternate formulas for α.

1because the numerator and denominator have the same binomial coefficients in them, they are can-celed out and are removed from the formula, greatly simplifying its calculation

15

In 2000, a new version of the Summarist topic signature module is implemented us-

ing LLR instead of TF/IDF (Lin and Hovy, 2000). Lin and Hovy compare their TF/IDF

and LLR-based systems and conclude that the LLR is the best performing solution.

Their paper is the first to comprehensively describe Dunning’s LLR measure as it re-

lates to automatic text summarization.

2.4 Abstractive approaches to text summarization

The DUC and TAC evaluation results of the last 11 years underscore the superiority of

human generated summaries over automated text summaries especially in regards to

linguistic quality and readability. A human summarizer is able to synthesize informa-

tion and introduce new terms and phrases to achieve a level of abstraction that cannot

be achieved by automated systems. Abstractive summarization systems attempt to

get closer to the quality of human summarizers by incorporating semantics, discourse

theory, and language generation.

2.4.1 SUMMONS

The SUMMONS (SUMMarizing Online NewS articles) news aggregator is an exam-

ple of an early abstractive system for multi-document text summarization. It used

event and activity content templates created for the 1992 ARPA Message Understand-

ing Conference (MUC) to extract data from a series of Associated Press and Reuters

newswire articles. The extracted data was then organized and enriched by a content

planner, which added associated information text from a knowledge base, and passed

this data to a language generator, which created English sentences with the proper

syntax and inflection, resulting in a summary paragraph of one or more sentences

(McKeown and Radev, 1995). SUMMONS was one of the first systems to prepare

summaries for a series of documents and provided a model for future explorations of

abstractive text summarization.

16

2.4.2 RALI-DIRO at TAC 2010

At TAC 2010, an abstractive system was submitted by a collaboration between the

University of Montreal’s Recherche appliquee en linguistique informatique (RALI) and

Departement d’informatique et de recherche operationnelle (DIRO) (Genest and La-

palme, 2010). Their abstractive system was based on an intermediate representation

of text, between extraction and generation, which they call an information unit (InIt).

They defined an information unit as the smallest element of coherent information that

can be extracted from a sentence. However, given the complexity of sentences, an infor-

mation unit may refer to an sentence as small as a single noun or extend to an entire

clause describing an event. For their TAC 2010 system, they restricted information

units to be subject-verb-object triples that could be extracted from dependency parses

of sentences using the MINIPAR 2 parser. They were able to attach time and location

information for each sentence as properties of the extracted triples. The triples, time

and location information, and original extracted sentences were then used to remove

redundancies in the extracted sentences and generate short concise English sentences

using the SimpleNLG 3 generation system.

2.4.3 Human EXtraction for TAC: HexTac

The HexTac system, a participant in TAC 2009, was designed to set an upper bound

on the extractive summarization approach and serve as one of the baseline systems

that other were compared to (Genest et al., 2009). The system was designed as a set of

tools for human summarizers to use to extract entire sentences from a set of texts and

create a 100 word summary. The human summarizers could only use their own judg-

ment to decide on which sentences were the best candidates and could not change any

aspect of the sentence itself. The summaries produced by the HexTac system received

higher scores that all automated text summarization systems for linguistic quality

and overall responsiveness, but still were unable to beat any of the human generated

2http://webdocs.cs.ualberta.ca/ lindek/minipar.htm3http://code.google.com/p/simplenlg/

17

abstractive model summaries. The system also performed very well in ROUGE eval-

uation, suggesting it could be a candidate approach for creating extractive models for

system comparison not unlike the abstractive models used in the TAC evaluation at

TAC 2009. The conclusions drawn in the HexTac paper overall is that although the

human extractive summaries were not able to beat human abstractive summaries in

regards to linguistic quality and overall responsiveness, they did perform better than

all automatic summarization systems, indicating that there is still ‘headroom’ for im-

provement in extractive summarization.

2.5 Summarization Shared Tasks

Automatic text summarization has been a featured task in the last 11 years of the

National Institute of Standards and Technology (NIST) sponsored annual workshops,

the Document Understanding Conference (DUC) 2001-2007, and the Text Analysis

Conference (TAC) 2008-2011. The annual workshops have provided a common data

and evaluation framework for participants to develop and compare text summarization

systems.

2.5.1 Document Understanding Conference

The Document Understanding Conference, sponsored by the Advanced Research and

Development Activity (ARDA), and run by the National Institute of Standards and

Technology (NIST), emerged in 2000 out of the need for a common evaluation frame-

work for summarization tasks previously sponsored programs run independently by

DARPA’s Translingual Information Detection Extraction and Summarization (TIDES),

ARDA’s Advanced Question & Answering Program and NIST’s TREC (Text Retrieval

Conferences). The first DUC conference was held September 13-14, 2001 in New Or-

leans, Louisiana 4. Twenty five groups participated and fifteen sets of summaries were

submitted for evaluation.

DUC followed a similar annual cycle over its seven year run: (1) call for participa-

4http://wwwnlpir.nist.gov/projects/duc/pubs/2001slides/pauls slides/index.htm

18

tion (2) release of test data (3) submission deadline (usually two weeks following test

data release) (4) return of evaluation results (5) submission of workshop papers (6)

workshop (a two day event) (7) final papers published.

Task Guidelines

DUC task guidelines evolved over the lifetime of the conference. During the first year,

both single and multiple document summaries of varying lengths, as well as generic

and query-focused summarization tasks were shared by participants. In its final year,

DUC 2007, the guided summarization task was defined as the creation of a 250 word

text summary of a given topic, a topic statement, and 25 pre-selected topic-related

newswire documents. Simple lists of names, events, dates, etc. were discouraged in

pursuit of fluent and readable summaries consisting of sentences. An optional update

task required a 100 word summary of new information from an additional document

set 25 pre-selected newswire documents. NIST assessors were responsible for select-

ing topic documents from a newswire corpus, defining topic titles, and queries. The

newswire documents were selected from the AQUAINT corpus, which includes Asso-

ciated Press and New York Times articles from 1998-2000 and Xinhua News Agency

articles from 1996-2000.

The DUC series of conferences showcased many advances in multi-document au-

tomated text summarization and lead to the development of the manual evaluation

Pyramid framework as well as the automated statistical automated tools, ROUGE

and BE.

2.5.2 Text Analysis Conference

The Text Analysis Conference was established in 2008 and the goal of the conference

was to re-emphasize and encourage participants to go beyond extractive summariza-

tion approaches and automated statistical measures to explore deeper linguistic analy-

sis. A new aspect-oriented approach reshaped the guided summarization task in 2009,

as well as the addition of two companion tasks: Knowledge Base Population (KBP) and

19

Recognizing Textual Entailment (RTE).

The TAC 2011 Guided Summarization task required participants create a one hundred-

word summary of ten newswire documents and a subsequent update one hundred-word

summary of an additional ten newswire documents. Forty four topic collections of two

document sets of ten relevant documents were divided into five pre-determined cate-

gories of topic:

1. Accidents and Natural Disasters

2. Attacks

3. Health and Safety

4. Endangered Resources

5. Investigations and Trials

Unlike the narrative topic inputs to the DUC summarization task (title and one or

more natural language sentences) TAC defines a set of aspects for each topic category

that guide the summarization task. For example the aspects of the topic category

Accidents and Natural Disasters are defined as:

1. WHAT: what happened

2. WHEN: date, time, other temporal placement markers

3. WHERE: physical location

4. WHY: reasons for accident/disaster

5. WHO AFFECTED: casualties (death, injury), or individuals otherwise negatively

affected by the accident/disaster

6. DAMAGES: damages caused by the accident/disaster

7. COUNTERMEASURES: countermeasures, rescue efforts, prevent efforts or other

reactions to the accident/disaster.

TAC Cycle

The TAC cycle is similar to DUC: registration, system development, test data release,

submission, evaluation, and workshop. Typically, participants develop systems based

20

on previous year’s data and when data are released they have a short period of time

to run their system against unseen test data to produce ‘runs’ of summaries that are

submitted to NIST for evaluation. Summaries are manually evaluated using three

frameworks: Pyramid, Readability/Fluency and Responsiveness. Summaries are also

evaluated using two automated statistical evaluation tools, ROUGE and BE.

2.6 DUC/TAC Evaluation Frameworks

During the series of DUC conferences between 2001-2007, both human and automatic

evaluation frameworks were developed to measure automatic text summarization sys-

tems. These frameworks continue to be employed in the TAC conferences 2008-2011.

The human evaluation frameworks include Pyramid and a simple scale for readabili-

ty/fluency and responsiveness. Automated statistical tools include ROUGE and BE.

2.6.1 Pyramid

The Pyramid method for human assessment of summaries was originally proposed for

DUC 2005 (Passonneau et al., 2005) and has been used every year since in both DUC

and TAC Workshops to evaluate summaries. The Pyramid method relies on human

annotators to define and identify Semantic Content Unit (SCU)s within a selection of

human model and automated peer summary submissions.

At the beginning of the TAC evaluation period, NIST assigns four human assessors

to each topic statement and its two sets of ten documents5. The assessors are respon-

sible for creating a one hundred word model summary based on the TAC published

guidelines for creating model summaries6. For the 2010 and 2011 TAC Workshops,

summaries are guided by categories and their aspect, therefore model summary au-

thors create their summaries with the intension of covering all aspects for each cate-

gory in the one hundred word summary.

5The two document sets represent the ten documents that have been identified as relating to the topicstatement for an initial and update summary

6http://www.nist.gov/tac/2011/Summarization/guided summarization.instructions.pdf

21

Pyramid evaluation requires the creation of SCUs for each topic’s four model sum-

maries. The SCUs represent the most important units of information identified by a

human in a summary. Each SCU is associated with a category aspect, and given a

weight based on how many of the four model summaries it occurs in. A SCU is a se-

mantic label which is associated with one or more contributors from each summary. A

contributor is a continuous or discontinuous string of words that have the same mean-

ing as the semantic label, see (Passonneau et al., 2005) for an example.

During the peer summary evaluation phase, the NIST assessor will score each peer

summary for the number of SCUs it contains. The repetition of SCUs does not increase

or decrease the score. SCUs are counted only once. The final Pyramid score for a peer

summary equals the sum of the weights of SCUs divided by the maximum possible

sum of SCUs7.

2.6.2 Readability/Fluency and Responsiveness

Readability/Fluency and Responsiveness are both evaluated on a scale of 1-5: (1) Very

Poor (2) Poor (3) Barely Acceptable (4) Good and (5) Very Good.

Readability/Fluency captures the grammaticality, non-redundancy, referential clar-

ity, focus, and structure and coherence of a summary. Aspect coverage or information

quality is not to be considered by assessors when scoring a summary for readabili-

ty/fluency. Responsiveness is a mixture of aspect coverage and readability, entities and

events relating to categories and aspects are important, but cannot simply be injected

into the summary without considerations of readability/fluency.

2.6.3 Recall-Oriented Understudy for Gisting Evaluation (ROUGE)

ROUGE8 was inspired by an investigation of using BLEU and NIST to evaluate the

quality of automatic summaries compared to human model summaries in (Lin and

Hovy, 2003). The ROUGE tool was subsequently developed and first featured in DUC

7determined by the average number of SCUs in the model summaries8http://berouge.com/default.aspx

22

2004(Lin, 2004). ROUGE compares automatically generated summaries to human

model summaries by looking at overlapping features like n-grams, word sequences,

and word pairs. Options at run-time to the ROUGE script configure the unit of overlap

that is measured and compared in the summaries. The two that are automatically run

for all systems during the evaluation phase of TAC are ROUGE-2, which is bigram

based, and ROUGE-SU4 which is also bigram based, but allows for a maximum 4

token gap between grams.

2.6.4 Basic Elements (BE)

In an effort to improve upon the ROUGE automatic evaluation tool, Hovy et al. in-

troduce a new evaluation tool for DUC 2005, Basic Elements (BE)9(Hovy et al., 2005).

BE tackles the problem of measurement unit for automatic comparison between sum-

maries. Basic Elements uses shallow syntactic information to construct increasingly

larger units starting with a single word. Summaries are then compared using the

Basic Elements units.

2.7 Natural Language Processing Software Libraries

The availability of open source natural language processing libraries for multiple oper-

ating systems platforms and programming languages supports the rapid development

of custom NLP processing pipelines with robust and tested off-the-shelf modules, sig-

nificantly reduced development time. In the sections following I describe two general

purpose software libraries for natural language processing, GATE and the Stanford

coreNLP package.

General Architecture for Text Engineering

The General Architecture for Text Engineering (GATE)10 from the University of Sheffield

is a multi-platform open source natural language processing suite of tools written in

9http://www.isi.edu/publications/licensed-sw/BE/index.html10http://gate.ac.uk/

23

Java (Cunningham et al., 2002). First released in 2000, GATE has evolved into an

embedded framework (Java libraries) that can be used to create custom NLP systems,

an Integrated Development Environment that provides a visual programming inter-

face for developing NLP solutions, and a cloud service for distributed service-oriented

software systems. GATE’s embedded framework consists of an extensive library of

NLP-oriented classes and methods in Java. It provides object structures for common

NLP resources, like documents, annotations, corpora, lexical databases, and ontolo-

gies and an execution pipeline that hosts one or more processing modules. GATE’s

logical architecture of resources, pipelines, and processing modules is similar to that of

Apache UIMA (Unstructured Information Management Architecture)11. It includes a

set of classes to integrate UIMA modules directly into a GATE pipeline and a strategy

to include GATE plug-ins into a UIMA pipeline. The component model in GATE is

called CREOLE (Collection of REusable Objects for Language Engineering).

ANNIE (a Nearly New Information Extraction system) is a default set of CREOLE

plugins that form a generic pipeline for linguistic annotation which consists of the

following components:

Document Reset Processing Resource: a component that removes any existing

annotation sets from a GATE Document Language Resource

English Tokenizer: a tokenizer that splits text into simple tokensnumbers, punctu-

ation and words of different types and pre-processes tokens using a JAPE trans-

ducer to provide additional features to the ANNIE PoS Tagger.

Gazetteer: a list-driven Named Entity tagger

Sentence Splitter: a sentence splitter that preprocesses sentences for the ANNIE

PoS Tagger

Part of Speech Tagger: a modified version of the Brill Tagger

Named Entity Transducer: Additional JAPE-based heuristics for Named Entity recog-

nition

Orthomatcher: an Orthographic Named Entity co-reference tagger

11http://uima.apache.org/

24

Stemmer: a rudimentary stemming algorithm

Anaphor Resolver: a co-reference resolution module that finds antecedents for peo-

ple, and optionally other Named Entities such as locations

GATE can export a default stand-off XML representation of its Annotation collections.

Stanford CoreNLP

The Stanford CoreNLP 12 package is a a comprehensive Natural Language Processing

library written in Java. Unlike GATE, the Stanford CoreNLP library does not include

an integrated development environment or an off-the-shelf plug-in architecture. It is

designed as a general framework that composes easily with end-to-end applications.

It is a fusion of a selection of Stanford standalone NLP tools into an integrated pro-

gramming and execution environment in Java. The central component in all Stanford

CoreNLP solutions is the pipeline of annotators. These modules produce a stand-off

XML representation of annotations over input text. Each annotator either adds to or

builds on previous annotations to produce an aggregate stand-off representation that

can be serialized into stand-off XML.

A example of a default pipeline of annotators for the Stanford coreNLP includes:

Tokenizer: a Penn Treebank-style tokenizer extended for noisy web input

Sentence splitter: sentence splitter that can be extended by parameters. By default

uses end-of-line.

Part-of-Speech Tagger: integration of the latest release of the standalone Stanford

Parser

Lemmatization: generates word lemmas for all tokens

Named Entity Recognition: integration of standalone extendable NER services. By

default PERSON, LOCATION, ORGANIZATION, MISC, DATE/TIME.

Dependency Parser: integration of the latest release of the standalone parser pro-

viding full syntactic information including constituents and dependencies.

12http://nlp.stanford.edu/software/corenlp.shtml

25

Co-reference Resolver: integration of standalone coreference resolution services.

26

Chapter 3

METHODOLOGY

In this chapter, I outline the methodology I followed to compare two methods of calcu-

lating LLR. The first is the standard method of calculating LLR used to rank sentences

for extractive text summarization described in (Nenkova and McKeown, 2011). The

second is an alternate method I propose based on dependency relations. I elaborate

on the motivation for my approach and describe the overall structure, program flow,

statistical methods, and evaluation framework I selected in the design of the systems

I built to compare the methods.

3.1 Topic Signatures

Topic signatures have been and continue to be used to select and rank candidate sen-

tences for automatic text summarization since being introduced by Lin and Hovy in

1999 (Hovy and Lin, 1999). Log Likelihood Ratio (LLR), LLR with cut-off (LLR-C),

and LLR with cut-off and query-focused (LLR-CQ) are three statistical measures of

frequency used to calculate topic signatures for summarization. The formula for calcu-

lating LLR is from (Dunning, 1993).

λ =L(p, k1, n1)L(p, k2, n2)

L(p1, k1, n1)L(p2, k2, n2)(3.1)

where likelihood is calculated 1:

L(p, k, n) = pk(1− p)n−k (3.2)

1Because the numerator and denominator have the same binomial coefficients in them, they are can-celed out and are removed from the formula, greatly simplifying its calculation

27

and the probabilities are defined as

p1 =k1n1, p2 =

k2n2, p =

k1 + k2n1 + n2

(3.3)

where k1 is the count of an information unit (typically a word) within the foreground,

n1 is the total number of information units in the foreground, k2 is the count of the in-

formation unit within the background corpus, and n2 is the total number of information

units in the background corpus. The information units in the foreground can be ranked

based on their LLR values. Information units with the highest LLR values are consid-

ered the most descriptive of the topic of the foreground corpus and are collectively de-

scribed as the topic’s signature (Hovy and Lin, 1999). LLR has been demonstrated to be

the best measure of descriptiveness for greedy sentence-by-sentence multi-document

summarization (Nenkova and McKeown, 2011).

Log Likelihood Ratio with Cut-off (LLR-C), assigns a cut-off value for LLR ranked

information units. Based on the restatement of Equation 3.1 in (Dunning, 1993),

−2logλ can be correlated to the chi-squared statistic.


−logL(p, k1, n1)− logL(p, k2, n2)](3.4)

In (Lin and Hovy, 2000) the cut off weight for −2logλ was set at 10.83 with confidence

level α t = 0.001 (chi-squared lookup). An LLR-C approach uses the cut-off value to

determine the information units that constitute a topic signature and then in the ap-

plication of the topic signature to weighting sentences, assigns a value of either one or

zero to information units within each sentence. The sentences with the highest aver-

age score, within a minimum and maximum length threshold, are considered the best

candidate sentences for extraction.

The Log Likelihood Ratio with Cut-off and Query-focused (LLR-CQ) approach uses

a query-filtered topic signature to assign weights to information units in sentences. A

query is transformed into a set of information units (typically words). The weighting

of sentences is then calculated by a one or zero for each information unit if it appears

in both the query and the topic signature.

28

3.2 Calculating LLR with Dependency Relations

In this thesis I propose an alternate method for calculating the probabilities used in

the formula for LLR. The standard method for calculating the probabilities for LLR is

described in the preceding section and Equation 3.3. Units of information are counted

as they occur within the sentences of a topic-focused corpus of documents (foreground)

and a more general non-topic-focused corpus of documents (background). I propose an

alternate method for counting units of information based on the dependency relations

derived from sentences rather than the sentence itself. In this new method, a unit of

information is counted for each dependency relation it participates in either as a de-

pendent or a governor. This alternate method is motivated by the observation that in

a collapsed and propagated dependency representation of a sentence, certain informa-

tion units participate in multiple relations (Marneffe et al., 2006). Below is an example

of collapsed and propagated dependency relations for the sentence, Bills on ports and

immigration were submitted by Senator Brownback, Republican of Kansas2.

nsubjpass(submitted, Bills)

auxpass(submitted, were)

agent(submitted, Brownback)

nn(Brownback, Senator)

appos(Brownback, Republican)

prep_of(Republican, Kansas)

prep_on(Bills, ports)

conj_and(ports, immigration)

prep_on(Bills, immigration)

Given the definition of probabilities for LLR calculation in Equation 3.3, the new

method requires a restatement of counts that make up each of the probabilities in the

formula. k1 is the count of the number of dependency relations an information unit par-

ticipates in within the collection of dependency relations generated in the foreground

corpus. n1 is the total number of dependency roles a unit of information participates

2http://nlp.stanford.edu/software/stanford-dependencies.shtml

29

in in the foreground. k2 is the number of dependency relations an information unit

participates in within the collection of dependency relations generated in the back-

ground corpus, and n2 is the total number of dependency roles a unit of information

participates in in the background. My hypothesis is that the count of the number of

dependency roles that a unit of information participates in could contribute to a more

descriptive topic signature3.

Given the sample sentence and generated dependency relations above, the proba-

bilities for information units in a sentence based on the standard method and my new

proposed method compared are:

Table 3.1: Comparison of simple sentence-based probabilities calculated by word count

and by participation in dependency role count

Count Bills on ports and immigration were submitted

word 1/13 1/13 1/13 1/13 1/13 1/13 1/13

dependency role 1/6 - 1/9 - 1/9 1/18 1/6

by Senator Brownback of Republican Kansas

word 1/13 1/13 1/13 1/13 1/13 1/13

dependency role - 1/18 1/6 - 1/9 1/18

In the sentence above, the proposed dependency relation count assigns higher proba-

bility to information units that participate in more relations.

The definition of unit of information for calculating LLR may also have an impact

on effectiveness of topic signatures. I defined four types of information unit that I

planned to use for experiments comparing the two methods of counting. Definitions

for the four types of information unit are:

3Collapsed and propagated dependency relations produced by the Stanford coreNLP package are de-scribed in (Marneffe et al., 2006) and in a short overview on http://nlp.stanford.edu/software/stanford-dependencies.shtml

30

1. a word

2. a case-neutral lemmatized word

3. a case-neutral lemmatized word combined with a part-of-speech tag

4. a case-neutral lemmatized word combined with a generalized part-of-speech tag

restricted to nouns, verbs, and adjectives.

In order to compare the two counting methods within the evaluation framework of an

applied task, I planned to take advantage of an existing evaluation framework, data,

and well defined shared-task for summarization. I based the design of my system on

the guided summarization task guidelines of TAC 20114 and participated in the 2011

TAC shared task.

3.3 TAC 2011 Guided Summarization Task

The TAC 2011 Guided Summarization task was to create a one-hundred-word sum-

mary of ten newswire documents and a subsequent one hundred-word update sum-

mary of ten additional newswire documents. The newswire documents were pre-

selected by NIST assessors and assigned to topic collections. Forty four topic collections

of two document sets of ten relevant documents were divided into five pre-determined

categories of topic: (1) Accidents and Natural Disasters (2) Accidents and Natural

Disasters (3) Attacks (4) Health and Safety (5) Endangered Resources and (6) Investi-

gations and Trials.

The NIST assessors for TAC 2011 defined a set of aspects for each topic category

that are intended for participants to guide the summarization task. For example the

aspects of the topic category Accidents and Natural Disasters are defined as:

1. WHAT: what happened

2. WHEN: date, time, other temporal placement markers

3. WHERE: physical location

4. WHY: reasons for accident/disaster

4http://www.nist.gov/tac/2011/Summarization/index.html

31

5. WHO AFFECTED: casualties (death, injury), or individuals otherwise negatively

affected by the accident/disaster

6. DAMAGES: damages caused by the accident/disaster

7. COUNTERMEASURES: countermeasures, rescue efforts, prevent efforts or other

reactions to the accident/disaster.

For my thesis, I wanted to compare methods of counting units of information only,

and for that reason, I did not feel it was necessary to extract novelty information for

the update task or use aspects to guide summarization in order to build a generic

text summarization system that could be evaluated using TAC model summaries and

the ROUGE statistical evaluation tools. The summaries I intended to generate for

both the initial and update topic-focused document collections were therefore simply

generic summaries.

3.3.1 TAC Cycle

The TAC 2011 cycle was similar to previous years: registration, system development,

test data release, submission, evaluation, and workshop.

Table 3.2: Text Analysis Conference 2011 Schedule

Date Milestone

June 3 (2011) Deadline for TAC 2011 track registration

July 1 Release of test data

July 17 Deadline for participants’ submissions

Sept 7 Release of individual evaluated results

Nov 14-15 TAC 2011 Workshop

3.4 Development and Testing Data

I planned to follow the training and testing cycle of the TAC conference, where previous

years’ data are used to develop and train a system in preparation for testing against the

32

current year’s data when it is released by NIST. In a typical conference cycle, the data

are released and participants are given a short period of days to test their system and

submit their summaries to NIST for evaluation. Evaluation results for all participants

and a comparison between teams are then published in the month following.

New for TAC 2011 was the availability of an alternate version of the source data,

called clean data. The clean data version of the training and testing data was created

by the CLASSY team from the IDA/Center for Computing Sciences and the University

of Maryland, a long time participant in DUC and TAC Workshops. Their sentence

pre-processing module is mature and very good at identifying correct sentence splits

and removing noise from the LDC AQUAINT and AQUAINT-2 newswire source data

collections. The only caveat for using the clean data format was an eight-day delay

after the official NIST TAC testing data release data. The clean data was released on

July 8, 2011. I chose to use the clean data versions of both the TAC 2010 and TAC

2011 data.

Table 3.3: Text Analysis Conference Data

TAC Corpus LDC Catalog Number

Train TAC 2010 LDC AQUAINT-2 LDC2008T25a

Test TAC 2011 TAC 2010 KBP Source Data LDC2010E12b

a http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2008T25b http://projects.ldc.upenn.edu/kbp/data/

The LDC AQUAINT-2 corpus consists of newswire articles from Agence France Presse,

Associated Press, Central News Agency (Taiwan), Los Angeles Times-Washington Post

News Service, Inc., New York Times, and Xinhua News Agency from the years 2004-

2006. The TAC 2010 KBP Source Data consists of newswire articles from New York

Times, Associated Press, and Xinhua News Agency from the years 2007-2008.

33

3.5 Evaluation

During development, the summaries generated by the system were to be evaluated

statistically using ROUGE. In the test phase of the TAC cycle, the system generated

runs of summaries would be additionally evaluated by NIST assessors using a manual

measure of readability/fluency, responsiveness, and the Pyramid method. Due to the

prohibitive cost in time and human effort, my plan was to evaluate any subsequent

runs post-TAC solely using ROUGE.

3.6 System Design

I selected the Stanford coreNLP package as the central natural language processing

component for my generic text summarization system. It provides a collection of an-

notations for input text sentences, including delimited word tokens, split sentences,

parts-of-speech, lemmatization, named entities, coreferences and dependencies. Addi-

tional modules pre-process the TAC document sets, prepare sentences for the coreNLP

pipeline, create information units from annotations post-pipeline processing, count in-

formation units for sentences, topics, and the overall corpus, calculate LLR values,

score, rank, and select sentences for summary generation. The following sections de-

scribe each of the steps of the program flow and system design. See Figure 3.1 for a

visual representation of the program flow and module composition of the system de-

sign.

34

Import from Clean Data

Tokenization

Sentence Splitting

POS Tagging

Morphological Analysis

NE Recognition

Syntactic Parsing

Coref Resolution

Stanford CoreNLP

Feature Extraction

LLR Calculation

Sentence Selection and Ranking

Summary Generation

Figure 3.1: System Program Flow

3.6.1 Pre-processing

The clean version of the 2010 and 2011 TAC data are packaged as a collection of XML

files with a one-to-one correspondence with the original topic file from the TAC data

set. The XML files contains tags and attributes that delimit meta-information, article

titles, and a post-sentence split list of sentences categorized with three possible values:

35

negative one for sentences that are considered to be “noise” in the article and are not

considered legitimate sentences, zero for heading/title sentences, and one for sentences

that are considered valid. The pre-processing module iterates through the TAC clean

data set and creates a new file of filtered sentences, metadata, and heading/titles.

3.6.2 Stanford coreNLP Annotation Pipeline

The annotation pipeline module is responsible for creating the appropriate represen-

tation of sentences for input to a Stanford coreNLP processing, initiating, executing,

and outputting the results of the coreNLP annotators. The annotators in the coreNLP

pipeline are: tokenization, sentence splitting, part-of-speech tagging, lemmatization,

named entity recognition, dependency parsing, and co-reference resolution. The output

of this module is a file containing the stand-off XML representation of the annotations

that result from running the pipeline.

3.6.3 Information Unit Extraction

The annotation file produced as the output of the coreNLP module provides the nec-

essary information to create information units for calculating topic signatures. Infor-

mation units are extracted from the annotation file and output in sentence/info unit

pairings as well as file-scoped collections that reflect the count of the information unit

within the overall file, across files in the topic document sets, and across the entire

corpus.

3.6.4 LLR Calculation

The output files from the previous module are used to calculate LLR values for each

information unit in the topic and overall corpus. These values are then output as

topic signature files. The formula for LLR calculation is in a simplified form from the

original in (Dunning, 1993) and was adapted from the formula found in (Piao et al.,

36

2005).

2 ∗ ((a log a) + (b log b)+

(c log c) + (d log d) + (n log n)−

((a+ b)log(a+ b))−

((a+ c)log(a+ c))−

((b+ d)log(b+ d))−

((c+ d)log(c+ d)))

(3.5)

where a is the number of occurrences of the information unit in the topic, b is the num-

ber of occurrences of the information unit in the background corpus, c is the number of

occurrences of all other information units in the topic, d is the number of occurrences

of all other information units in the background corpus, and emphn is the total num-

ber of all information units both in the topic and background corpus. The result of the

LLR calculations are individual topic signature files with a list of information units

and their LLR value.

3.6.5 Sentence Selection and Ranking

After all LLR calculations have been made for each topic signature, the topic signature

is used to weight and rank sentences. Each sentence in each topic document set is

scored based on the number of topic signature information units it contains and their

cumulative values. Sentences are ranked across all documents in the topic document

set and output in a sorted candidate sentence file.

3.6.6 Summary Generation

For each topic, the highest ranked candidate sentences are selected to form a summary,

filtered by a basic token-based redundancy measure and the maximum length of 100

words. A running total of the number of words in the summary as well as a hash set

of tokens of candidate sentences already chosen are used to filter candidate sentences.

If adding a sentence would exceed the length of 100 words, it is skipped and the next

37

sentence is considered. The process continues until only an acceptable gap remains

between the size of the summary and the maximum 100 words.

38

Chapter 4

IMPLEMENTATION

In this chapter I describe the implementation of three generic text summarization sys-

tems I built in the course of my thesis work. The chapter highlights the differences

between the three systems and how they diverge in implementation from the original

system design described in my methodology chapter. I changed the design of my system

and experiments after completing System I and participating in the TAC shared-task

life cycle. System II and System III are very different than System I, and because of

these differences cannot really be compared based on the ROUGE scores of the sum-

maries they generated. System III represented the final version of my original design

and was used to conduct experiments on the two approaches to counting using the

four definitions of information units. These experiments and results are discussed in

Chapter 5.

4.1 Systems Overview

System I was developed in the period before the release of the TAC 2011 test data and

was used to generate two runs of summaries for official submission to TAC on July

17, 2011. It was further re-factored post-TAC evaluation for additional experiments

as System II. Based on a review of both System I and II, implementation errors and

design flaws were identified and fixed and a third system was built. Because of the

differences between System I and the other two systems, the ROUGE scores of the

summaries it generated cannot really be compared to System II and System III.

All three systems were developed primarily in Java on Linux, using the Stanford

coreNLP package for annotation, Bash scripts for preparing and executing scripts, and

Condor for parallelizing jobs based on data segmentation. The version of the Stanford

coreNLP package incorporated into System I was 1.1, released on 2011-06-08. System

39

II and III incorporated version 1.2 released on 2011-09-18.

The Stanford Parser Annotator in the coreNLP package is version 1.6.9. The PCFG

parser and factored parser are explained in depth in (Klein and Manning, 2003a) and

(Klein and Manning, 2003b). The English Stanford Dependencies representation out-

put by the dependency parser are described in (Marneffe et al., 2006). The Part-Of-

Speech Tagger is version 3.0.4 and is described in (Toutanova et al., 2003). Stanford

Named Entity Recognizer (NER) is version 1.2.2 and is described in (Finkel et al.,

2005). The Stanford Deterministic Coreference Resolution System is described in (Lee

et al., 2011) and (Raghunathan et al., 2010).

4.2 System I

System I was realized as nine individual Java applications. The TAC 2010 and 2011

data was segmented into 92 and 88 topic-based collections, which enabled each appli-

cation to iterate over the TAC data as 92 and 88 parallel jobs on a Condor computing

cluster.

Table 4.1: System I: Application components

App Description

A01 filter clean data files, extract sentences, and output line-delimited sentence files

A02 create topic batch file lists

A03 Stanford coreNLP pipeline(tokenize, ssplit, pos, lemma, ner, parse, dcoref)

A04 Part-of-speech + lemma info unit extraction

A05 Dependency info unit extraction

A06 Build info unit counts

A07 Calculate LLR for info units

A08 Part-of-speech + lemma info unit generate summary

A09 Dependency info unit generate summary

40

Sentence Pre-processing

The clean data version of TAC 2010 and 2011 data was used for training and testing

data. Pre-processing was reduced to simply selecting the correctly categorized type of

sentence from the clean data representation (negative one and zero are ignored, one

is considered a candidate), harvesting a small amount of metadata for each file and

sentence and then serializing to a line-delimited sentence file for further processing.

No filters are used for document noise or additional sentence validation. App 01 was

used as a standalone application to create the line-delimited sentence file for each

article in each topic article document set (A and B) and App 02 was used to create file

lists for each topic to batch process with Stanford coreNLP package.

Sentence Annotation and Processing with the Stanford CoreNLP Package

A custom Stanford coreNLP pipeline applies individual coreNLP Annotators to each

sentence in each document in the corpus for both initial and update summary collec-

tions. The Annotations created by the pipeline of Annotators are serialized for each

document. The XML version of the Annotations is output by document in order to gen-

erate and access the co-reference information provided by the Stanford Deterministic

Coreference Resolution System across all the sentences in a document.

The CoreNLP options used were:

tokenize, ssplit (sentence splitter), pos (part-of-speech tagger),

lemma (stemmer), ner (Name Entity Recognizer),

parse (dependency parser), dcoref (coreference resolver)

App 03 fulfilled this process in the program flow and was realized as a Bash script that

called the Stanford coreNLP package from the command line providing the topic file

list as an argument and a output directory path for annotation output files.

Information Unit Extraction

App 04, App 05, and App 06 extract information units from the previous application’s

annotation files and creates counts of the units for each sentence, topic, and the overall

41

corpus. These counts are used for LLR calculation and sentence scoring in subsequent

modules. The feature count pairs are serialized to a line-delimited file for further

processing.

The following Penn Treebank style parts-of-speech annotations are selected exclu-

sively by the system for information unit creation. All other labeled tokens are dis-

carded.

CD, FW, IN, JJ, JJS, JJR, NN, NNP, NNS, NNPS, NPS, RB, RBR, RBS, R, SYM,TO,

VBD, VBN, VBG, VBP, VB, VBZ

The following verbs are also filtered:

is, are, were, be, have, could, shall, should, may, might, must, will, would,

go, goes, do, does, use, used, take, make, made, did, been, said, say, know

Named entity types output by default from the Stanford Named Entity Recognizer

include:

PERSON, ORGANIZATION, LOCATION, DATE, MONEY, MISC

tokens for the parts-of-speech + lemma information units are:

PART-OF-SPEECH_LEMMA, NER-TYPE_LEMMA

Information units based on dependencies include:

DEP_PART-OF-SPEECH_LEMMA, GOV_PART-OF-SPEECH_LEMMA

DEP_NER-TYPE_LEMMA, GOV_NER-TYPE_LEMMA

GOV_PART-OF-SPEECH_LEMMA_RELATION-TYPE_DEP_PART-OF-SPEECH_LEMMA

GOV_NER-TYPE_LEMMA_RELATION-TYPE_DEP_NER-TYPE_LEMMA

GOV_PART-OF-SPEECH_LEMMA_RELATION-TYPE_DEP_NER-TYPE_LEMMA

GOV_NER-TYPE_LEMMA_RELATION-TYPE_DEP_PART-OF-SPEECH_LEMMA

All co-references are disambiguated using document-wide co-reference annotations

created by the Stanford Deterministic Coreference Resolution System, and output in

the canonical form above depending on what entity they originally reference, whether

it is also identified as a named entity and what dependency relationship it participates

in.

42

LLR Calculations

An LLR calculation module iterates over the document sentence information unit/-

count files in order to calculate LLR for each term. The LLR values for each sentence

are serialized to a line-delimited file for further processing. App 07 calculates LLR and

outputs topic signature files for each topic.

Sentence Selection and Summary Generation

Sentences are ranked based on the aggregate counts of all of a sentence’s information

units based on topic signature values. Selected sentences are filtered for noise, re-

moving any artifacts that were not caught by the clean data process and have been

seen in the development data. Sentences of less that seven or greater than fifty words

are automatically discarded as are sentences that share more than seventy percent

of their information units with selected sentences. This filter is a simple brute force

comparison between the sentences’ unit of information collections. Sentence selection

continues until a threshold minimum of ninety three words or maximum of one hun-

dred words is reached (sentences are excluded if they are larger than the delta between

the selected sentences word count and the maximum of one hundred words).

4.2.1 System I Evaluation

Two runs, 8 and 29, generated by System I were submitted to TAC for evaluation. They

were differentiated by the counting methods they employed. Run 8 used the standard

count of information units in the sentences of the foreground and background corpus.

Run 29 used the count of the number of times information units participated in depen-

dency relations generated by the sentences of the foreground and background corpus.

However, the information units defined for both runs were not really comparable. They

were both based conceptually on a case-neutral lemmatized word combined with a re-

stricted part of speech, but were not realized as such in the implementation. For run

8, the Penn Treebank part-of-speech tag was combined with a case-neutral lemma

of the word as the basic unit of information, but entirely different information units

43

were used for named entities, and the dependency structures in RUN 29. These errors

among others were remedied in System II.

In Tables 4.2, 4.4, and 4.5 below, the two TAC baseline runs, 1 and 2, are included.

Baseline 1 was created by using the first 100 words from the most recent newswire

article in the summary document set. Baseline 2 was created with the open source

off-the-shelf summarizer, MEAD1.

Table 4.2: System I Evaluation

RUN RGE-1 RGE-2 RGE-SU4 RUN RGE-1 RGE-2 RGE-SU4

A 8 0.3104 0.0591 0.0970 B 8 0.3079 0.0554 0.0970

A 29 0.3328 0.0749 0.1122 B 29 0.3196 0.0687 0.1054

A 1 0.3184 0.0673 0.1046 B 1 0.3054 0.0590 0.0983

A 2 0.3597 0.0964 0.1309 B 2 0.3207 0.0666 0.1035

4.3 System II

The second version of the generic text summarization system was re-factored into a

single Java application with a configuration file and command line arguments that ex-

ecute different component functionality. The new version of the system integrated the

Stanford coreNLP package from within the Java application through coreNLP Applica-

tion Programming Interfaces (API)s and reused methods that were redundant across

applications in the original collection of applications. All input and output files were

represented in XML, which made file readers and writers standard for all data and

intermediary files and made possible combination files that included the original TAC

clean data XML and the annotation tags output from the Stanford coreNLP pipeline.

System II takes fifteen different sets of command line arguments to enable fifteen in-

dependent modules of functionality. All fifteen stages of the summarization process

are executed using Bash scripts with topic-scoped arguments enabling parallelization

across the topic document sets on a Condor computing cluster.

1http://www.summarization.com/mead/

44

Significant changes were made to the way information units were represented in

System II which made the two runs it generated more comparable.

Table 4.3: System II: Application Components

App Description

A01 convert clean data to sentences

A02 convert sentences to annotations

A03 combine sentence and annotation files

A04 create info units

A05 calculate topic info unit totals

A06 calculate corpus info unit totals

A07 calculate LLR topic signatures

A08 create dependency info units

A09 calculate dependency topic info unit totals

A10 calculate dependency corpus info unit totals

A11 calculate dependency LLR topic signatures

A12 calculate sentence LLR scores

A13 calculate sentence dependency LLR scores

A14 create summaries

A15 create dependency summaries

Sentence Pre-processing

System II differs from System I at the sentence pre-processing stage by outputting an

XML file that includes additional meta-information, a heading sentence, and a raw text

line delimited tag for calling the Stanford coreNLP package API with a line-delimited

multi-sentence input String argument, enabling co-reference resolution across sen-

tences. A drawback of System I was its separation of the original sentences files from

the annotation XML. Rules for re-applying whitespace rules to punctuation and the

re-construction of the original sentence was required in System I. The new version of

the system represents the filtered TAC clean data information to be combined with the

45

annotation file by merging the XML representations in subsequent stage.

Sentence Annotation and Processing with the Stanford CoreNLP Package

System II uses a new version of the coreNLP package (version 1.2 – released 2011-

09-14) 2 and no longer calls the package on the command line. The coreNLP package

is integrated into the Java application itself and uses the coreNLP API for initiating,

executing, and outputting an annotation pipeline.

Information Unit Extraction

System II further restricts parts-of-speech and normalized the Penn Treebank tags to:

N for all nouns, V for all verbs, and ADJ for all adjectives.

FW, JJ, JJS, JJR, NN, NNP, NNS, NNPS, NPS, VBD, VBN, VBG, VBP, VB, VBZ

Verbs are no longer explicitly filtered. The LLR topic signature will de-emphasize

verbs that occur across the corpus. Named entity types are not output by the system

as explicit named entity tokens. System II no longer counts named entities in its

calculations. Tokens for the parts-of-speech + lemma information units are:

PART-OF-SPEECH_LEMMA

Information units based on dependencies are no longer explicit but are normalized to

part-of-speech + lemma. The token is counted for each dependency relation it partici-

pates in.

A bug in the new version of the Stanford coreNLP package occasionally gives an

unreachable index for representative mentions. Coreference resolution was disabled

for System II.

2http://nlp.stanford.edu/software/corenlp.shtml

46

LLR Calculations

The LLR calculation was changed in System II and now implements the classic formula

from (Dunning, 1993).


−logL(p, k1, n1)− logL(p, k2, n2)](4.1)

which correlates to the statistic chi-squared. The chi-squared distribution model can

be used to establish statistical thresholds for determining topic signatures. In (Lin and

Hovy, 2000) the cut off weight for −2logλ was set at 10.83 with confidence level α t =

0.001 (chi-squared lookup). System II uses this version of LLR with cut-off (LLR-C) to

determine with information units are descriptive and should be included in the topic

signature.

Sentence Selection and Summary Generation

In System I, sentences were ranked based on the aggregate LLR values of the informa-

tion units they contain. In System II, LLR-C is used and sentence information units

are simply given a value of 1 if they are part of the topic signature or 0 if they are not.

Sentences are then ranked by their cumulative score of equally valued topic signature

information units.

4.3.1 System II Evaluation

The following were the ROUGE results for the two sets of summaries generated by

System II, run 75 and 76 against the TAC 2011 corpus. Runs 75 and 76 were defined

by the same counting strategies and definition of unit of information as System I runs

8 and 29.

47

Table 4.4: System II evaluation


A 75 0.31109 0.06944 0.10810 B 75 0.27632 0.04991 0.08623

A 76 0.28334 0.05263 0.09151 B 76 0.26238 0.04184 0.07853

A 1 0.3184 0.0673 0.1046 B 1 0.3054 0.0590 0.0983

A 2 0.3597 0.0964 0.1309 B 2 0.3207 0.0666 0.1035

4.4 System III

System II was intended to simplify the generic text summarization system and provide

a consistent representation of information units used in both counting approaches so

they could be truly compared. System III has the same program flow and module

composition as System II. The sections below provide details about the implementation

details between the two systems.

LLR Algorithm

Minor errors were discovered in first two versions of the LLR algorithm used in Sys-

tem I and System II. The new version of the LLR algorithm in System III reversed

the System II design and returned to a raw aggregate LLR-C number for topic signa-

ture inclusion rather than the LLR-CQ approach of System II. It also incorporated a

smoothing +1 count to overcome 0 background corpus counts for information units that

only existed within the foreground corpus. All calculations were reduced to natural log

(ln) additions and subtractions to improve performance.

Sentence Selection

The redundancy measure in System III was changed to use only information units

not surface string/token comparisons. Basic sentence filtering was adjusted to mini-

mum token count of 10, maximum of 100, and acceptable delta of 10 (sentences can be

between 90 and 100 tokens).

48

Sentence Cleaning

In System III sentences are filtered for location + UTC slug lines that only occur in

TAC 2011 data.

4.4.1 System III Evaluation

The following were the ROUGE results for Run 77 and 78 against the TAC 2011 corpus.

Run 77 and 78 were defined by the same counting strategies and definition of unit of

information as System I runs 8 and 29.

Table 4.5: System III evaluation


A 77 0.34924 0.09224 0.12629 B 77 0.31139 0.06056 0.09846

A 78 0.33460 0.07910 0.11352 B 78 0.30408 0.05452 0.09479

A 1 0.3184 0.0673 0.1046 B 1 0.3054 0.0590 0.0983

A 2 0.3597 0.0964 0.1309 B 2 0.3207 0.0666 0.1035

4.5 Comparison of Systems

The following tables compare System II and System III average F-measure ROUGE

results for the two counting approaches based on a common definition of information

unit.

Table 4.6: System Comparison: TAC 2010 Summary A/B ROUGE average F-measure

System 1 2 SU4 System 1 2 SU4

II-A 0.31109 0.06944 0.10810 II-B 0.27632 0.04991 0.08623

III-A 0.34924 0.09224 0.12629 III-B 0.31139 0.06056 0.09846

49

Table 4.7: System Comparison: TAC 2011 Summary A/B ROUGE average F-measure

System 1 2 SU4 System 1 2 SU4

II-A 0.28334 0.05263 0.09151 II-B 0.26238 0.04184 0.07853

III-A 0.33460 0.07910 0.11352 III-B 0.30408 0.05452 0.09479

4.6 Conclusion

Based on the system comparison, System III outperforms System II in all four topic

collections. System III was selected to run experiments to compare the two counting

methods and four definitions of information units. Experiments and results are dis-

cussed in the next chapter.

50

Chapter 5

EXPERIMENTS AND RESULTS

In this chapter I describe the design of the experiments I implemented for compar-

ing the two methods of counting I contrasted in Chapter 3 and discuss their results. In

Chapter 3, I proposed an alternate method for calculating LLR using the count of an in-

formation unit’s participation in dependency relations generated from sentences in the

foreground and background corpus. The new method differs from the standard method

of simply counting an information unit’s number of occurrences in the sentences of a

foreground and background corpus. Table 3.1 in Chapter 3 contrasts the probabilities

of a sentence’s information units based on the two different methods of counting. The

different probabilities for a simple sentence suggest that a count used to calculate LLR

based on dependency relations may boost the count of important information units and

contribute to a more descriptive topic signature.

To contrast the two methods of counting, I designed a series of experiments using

the Guided Summarization Task guidelines, data, and evaluation framework for TAC

2010 and TAC 2011. The goal of the experiments was to test the hypothesis within the

context of an established task, data, and evaluation framework.

In the first applications of LLR in text summarization, described in (Hovy and Lin,

1999) and (Lin and Hovy, 2000), information units were defined simply as words. I in-

cluded the definition of an information unit as a word as a baseline in my experiments

and included three other definition of information unit, increasing in abstraction and

restriction. If the results of the more restricted definitions of units of information were

the same or better than words themselves, the restricted method would be preferred

due to the reduced number of information units that are counted and used to calculate

LLR. System performance would improve because the amount of memory and number

of calculations would be reduced.

51

The four different definitions of information unit used to compare the two counting

methods were:

1. a word

2. a case-neutral lemmatized word

3. a case-neutral lemmatized word combined with a part-of-speech tag

4. a case-neutral lemmatized word combined with a generalized part-of-speech tag

restricted to nouns, verbs, and adjectives.

5.1 Experiment Design

The following table describes the type of counting method and unit of information

definition used to produce a collection of summaries for TAC 2010 and TAC 2011 data.

The standard count label refers to the standard method of counting information units

in a topic by their occurrence in the sentences of the documents in the foreground

and background corpus. The dependency count label refers to the proposed alternate

method of counting information units in a topic by their participation in the sentence-

based dependency relations of the document in the foreground and background corpus.

Table 5.1: Description of Experiments

RUN ID Type of Count Definition of Unit of Information

ID 77 standard case-neutral lemmatized word combined with part-of-speech

restricted to nouns, verbs, and adjectives

ID 78 dependency case-neutral lemmatized word combined with part-of-speech


ID 79 standard word

ID 80 dependency word

ID 81 standard case-neutral lemmatized word

ID 82 dependency case-neutral lemmatized word



52

Included in the results tables are the two baseline runs created by TAC NIST assessors

for TAC 2010 and 2011. ID 1 Baseline was created by using the first 100 words from the

most recent newswire article in the summary document set. ID 2 Baseline is created

with the open source off-the-shelf summarizer, MEAD1.

5.2 TAC 2010 ROUGE Average F-measure Results

The row highlighted in bold represents the best performing counting method and unit

of information definition.

Table 5.2: Experiment Results: TAC 2010 Summary A ROUGE Average F-measures

RUN ID 1 2 SU4

ID 77 0.31877 0.06858 0.10403

ID 78 0.31444 0.06540 0.10002

ID 79 0.31537 0.06827 0.10367

ID 80 0.31814 0.06895 0.10348

ID 81 0.32644 0.06960 0.10542

ID 82 0.30948 0.06343 0.09893

ID 83 0.31770 0.06809 0.10409

ID 84 0.31491 0.06321 0.10076

ID 1 0.29531 0.05651 0.09029

ID 2 0.29861 0.06077 0.09361

The best performing run for topic A summaries in the TAC 2010 data collection was run

81. This run used a standard method for counting and defined its unit of information

as a case-neutral lemmatized word. Examples of summaries and topic signatures for

the best and worst performing runs against individual TAC 2010 Summary A topics is

features in Section A.1.1 of Appendix A.

The row highlighted in bold represents the best performing counting method and

unit of information definition.

1http://www.summarization.com/mead/

53

Table 5.3: Experiment Results: TAC 2010 Summary B ROUGE Average F-measures

RUN ID 1 2 SU4

ID 77 0.30453 0.05869 0.09600

ID 78 0.29659 0.05326 0.09098

ID 79 0.30185 0.05895 0.09423

ID 80 0.30025 0.05599 0.09387

ID 81 0.30052 0.05880 0.09469

ID 82 0.29690 0.05348 0.09152

ID 83 0.29844 0.05706 0.09317

ID 84 0.29657 0.05176 0.09079

ID 1 0.29087 0.05634 0.09273

ID 2 0.30331 0.06443 0.09923

The best performing run for topic B summaries in the TAC 2010 data collection was run

77. This run used a standard method for counting and defined its unit of information

as a case-neutral lemmatized word combined with part-of-speech restricted to nouns,

verbs, and adjectives. Examples of summaries and topic signatures for the best and

worst performing runs against individual TAC 2010 Summary B topics is features in

Section A.1.2 of Appendix A.

5.3 TAC 2011 ROUGE Average F-measure Results

The row highlighted in bold represents the best performing counting method and unit

of information definition.

54

Table 5.4: Experiment Results: TAC 2011 Summary A ROUGE Average F-measures

RUN ID 1 2 SU4

ID 77 0.34924 0.09224 0.12629

ID 78 0.33460 0.07910 0.11352

ID 79 0.35633 0.09160 0.12797

ID 80 0.34098 0.08069 0.11740

ID 81 0.33841 0.08550 0.11957

ID 82 0.33945 0.07870 0.11556

ID 83 0.35215 0.08878 0.12534

ID 84 0.33564 0.08151 0.11683

ID 1 0.3184 0.0673 0.1046

ID 2 0.3597 0.0964 0.1309

The best performing run for topic A in the TAC 2011 data collection was run 79. This

run used a standard method for counting and defined its unit of information as a word.

Examples of summaries and topic signatures for the best and worst performing runs

against individual TAC 2011 Summary A topics is features in Section A.1.3 of Ap-

pendix A.

The row highlighted in bold represents the best performing counting method and

unit of information definition.

55

Table 5.5: Experiment Results: TAC 2011 Summary B ROUGE Average F-measures

RUN ID 1 2 SU4

ID 77 0.31139 0.06056 0.09846

ID 78 0.30408 0.05452 0.09479

ID 79 0.30617 0.05993 0.09788

ID 80 0.30223 0.05859 0.09659

ID 81 0.30961 0.06417 0.10081

ID 82 0.30736 0.06066 0.09882

ID 83 0.30476 0.05841 0.09702

ID 84 0.30843 0.05779 0.09645

ID 1 0.3054 0.0590 0.0983

ID 2 0.3207 0.0666 0.1035

The best performing run for topic B in the TAC 2011 data collection was run 77. This

run used a standard method for counting and defined its unit of information as a case-

neutral lemmatized word combined with part-of-speech restricted to nouns, verbs, and

adjectives. Examples of summaries and topic signatures for the best and worst per-

forming runs against individual TAC 2011 Summary B topics is features in Section

A.1.4 of Appendix A.

56

Chapter 6

CONCLUSION AND FUTURE WORK

In this chapter I discuss conclusions about the comparison of the standard method for

counting information units for LLR calculations and my proposed alternate method

for counting information units for LLR calculations based on dependency relations. I

derive my comparisons from the results of experiments described in Chapter 5. I also

critique the overall design of my experiments and discuss how they can be improved

in a future work section.

6.1 Conclusion

In Chapter 3, I proposed an alternate counting method for calculating LLR for topic

signatures. Instead of the standard count of the number of times an information unit

occurs within the sentences of a foreground and background corpus, my alternative

method counted the number of times an information unit participated in either depen-

dent or governor roles in the dependency relations generated from sentences in the

foreground and background corpus. In the experiments I designed, I compared the

two methods using data from the TAC 2010 and TAC 2011 Guided Summarization

Tasks. The two methods were combined with four definitions of information unit and

contrasted by the results of a ROUGE evaluation of their n-gram overlap with human

generated model summaries. Table 5.1 in Chapter 5 describes the individual exper-

iments that were run on TAC data. Tables 5.2, 5.3, 5.4, and 5.5 list the ROUGE-1,

ROUGE-2, and ROUGE-SU4 average F-measure results of the evaluation of the sum-

maries generated by System III on TAC 2010 and 2011 data.

The results of experiments listed in Chapter 5 indicate minimal differences between

the two counting methods and are inconclusive regarding which of the unit of infor-

mation definitions was most effective in generating summaries. The best-performing

57

runs all used the standard method for counting information units, however the dif-

ference between these runs and all other runs was small. For example, in Table

5.5, the best run, 77 which used a standard approach to counting, had an average

F-measure ROUGE-1 score of 0.31139, ROUGE-2 score of 0.06056, and ROUGE-SU4

score of 0.09846. The run which used the same unit of information definition, but the

proposed dependency-based counting method, had an average F-measure ROUGE-1

score of 0.30408, ROUGE-2 score of 0.05452, and ROUGE-SU4 score of 0.09479. All

of the average F-measure experiments tabled in Chapter 5 have similar results. The

standard approach to counting outperforms the dependency-based approach, but the

difference between them is small.

In two of the four series of experiments, whose results are listed in Table 5.3 and

Table 5.5, the best performing relied on the definition of unit of information as a case-

neutral lemmatized word combined with part-of-speech restricted to nouns, verbs, and

adjectives. The other two tables, Table 5.2 and Table 5.4 used a word and a case-

neutral lemmatized word respectively. Again, the differences between the results of

the experiments of all the definitions of unit of information was minimal.

In Appendix A, the best and worst performing runs against individual topics are

listed for each of the four series of experiments, including the summary and topic sig-

nature they produced. In some cases, like the best performing runs for topic D1024F-A,

multiple runs have the same average F-measure, the exact same summary, but slightly

different topic signatures. For each of the best and worst performing runs for a topic,

all of the other runs are listed in a table, for example see Table A.23, to compare their

scores. These tables demonstrate that many runs, which have different topic signa-

tures, actually produce very similar if not exactly the same summary. Given the lim-

ited size of the final summary, one hundred words, and the small amount of sentences

within each corpus, different topic signature may end up picking the same or similar

sentences due to the lack of variety in the corpus. A larger summary, like the 250 word

summaries used in previous DUC shared tasks may be a better measure of counting

strategy and definition of unit of information.

58

6.2 Future Work

Extractive summarization systems weight and rank sentences in single or multiple

documents and then extract the best candidate sentences to form a summary. Al-

though most extractive systems employ some heuristics to ‘smooth’, ‘prune’, or ‘edit’

the extracted sentences and assembled summary, most of the words and phrases in

the summary originate directly from the original text. A much more difficult problem

is that of automatic abstractive summarization, where the machine needs to abstract

over the original information in single or multiple documents and generate new sen-

tences using words or phrases that may never have been in the original texts. The

field of automatic abstractive summarization is much less mature and there are very

few systems that have been developed to solve this problem. The abstractive approach

requires research into semantic representation, inference, and natural language gen-

eration. The majority of summarization systems have instead chosen to focus on ex-

tractive data-driven approaches (Erkan and Radev, 2004).

The topic signatures generated by the two different counting methods in this the-

sis were different from each other, but did not necessarily result in different 100-word

summaries when they were applied in System III experiments. One of the constraints

of the current system is, although dependency relations are being used to calculate

LLR topic signatures, the LLR values are being used to extract complete sentences

from a fairly small collection of topic documents. A future avenue of research might be

to extract specific dependency relations based on the topic signatures in an expanded

foreground corpus and generate sentences from the dependency relations themselves.

A semi-abstractive approach featured in the TAC 2010 Workshop (Genest and La-

palme, 2010) employed dependency relation tuples and additional linguistic informa-

tion from a shallow NLP pipeline as input to a natural language generation tool.

A promising direction towards an even more abstractive approach could be the re-

alization of a semi-abstractive summarization system integrating shallow and deep

processing. Leveraging the output of a deep processing infrastructure either in par-

allel or within in a shallow NLP pipeline offers the opportunity to integrate multiple

59

facets of linguistic information. A possible initial deep processing representation to ap-

ply statistical methods for summary generation might be the use of Minimal Recursion

Semantics (MRS) structures rather than surface strings to represent both query and

corpus and applying a frequentist approach using statistical methods for extractive

sentence selection, like an LLR topic-signature approach, with features based on a se-

mantic representation. Given the limitations of deep processing parsers with respect

to ill-formed or ungrammatical sentences, the integration of a shallow NLP pipeline

would allow a fall back to a shallow representation for complete coverage. Finally,

MRS-based generation tools could be used to create the final summary from assem-

bled MRS, using deep processing generation tools. This novel approach is supported

by earlier work on using MRS-based structures in Question and Answering (Dridan,

2006) and semantic search for scientific articles (Schafer et al., 2011).

60

BIBLIOGRAPHY

Baldwin, B. and Ross, A. (2001). Baldwin language technology’s DUC summarization

system. In Proceedings of Document Understanding Conference.

Barzilay, R. and Lapata, M. (2008). Modeling local coherence: An entity-based ap-

proach. Computational Linguistics, 34(1):1–34.

Blake, C., Kampov, J., Orphanides, A. K., West, D., and Lown, C. (2007). UNC-CH at

DUC 2007: Query expansion, lexical simplification and sentence selection strategies

for Multi-Document summarization. In Proceedings of Document Understanding

Conference.

Boros, E., Kantor, P. B., and Neu, D. J. (2001). A clustering based approach to creating

multi-document summaries. In Proceedings of Document Understanding Conference.

Bosma, W. (2005). Query-based summarization using rhetorical structure theory. In

15th Meeting of CLIN, LOT, Leiden, pages 29–44.

Brunn, M., Chali, Y., and Pinchak, C. J. (2001). Text summarization using lexical

chains. In Proceedings of Document Understanding Conference.

Carenini, G. and Cheung, J. C. K. (2008). Extractive vs. NLG-based abstractive sum-

marization of evaluative text: The effect of corpus controversiality. In Proceedings of

the Fifth International Natural Language Generation Conference, pages 33–41.

Conroy, J. M., Schlesinger, J. D., and O’Leary, D. P. (2007). Classy 2007 at DUC 2007.

In Proceedings of Document Understanding Conference.

Conry, J. M., Schlesinger, J. D., Rankel, P. A., and O’Leary, D. P. (2010). Guiding

CLASSY toward more responsive summaries. In Proceedings of the Text Analysis

Conference.

61

Copeck, T., Inkpen, D., Kazantseva, A., Kennedy, A., Kipp, D., Nastase, V., and Sz-

pakowicz, S. (2006). Leveraging DUC. In Proceedings of Document Understanding

Conference.

Cunningham, H., Maynard, D., Bontcheva, K., and Tablan, V. (2002). GATE: A Frame-

work and Graphical Development Environment for Robust NLP Tools and Applica-

tions. In Proceedings of the 40th Anniversary Meeting of the Association for Compu-

tational Linguistics (ACL’02).

Dridan, R. (2006). Using minimal recursion semantics in Japanese question answering.

PhD thesis, University of Melbourne Melbourne, Australia.

Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence.

Computational linguistics, 19(1):61–74.

Edmundson, H. P. (1969). New methods in automatic extracting. J. ACM, 16:264–285.

Erkan, G. and Radev, D. R. (2004). LexRank: graph-based lexical centrality as salience

in text summarization. In Proceedings of Document Understanding Conference.

Favre, B., Bechet, F., Bellot, P., Boudin, F., El-Beze, M., Gillard, L., Lapalme, G., and

Torres-Moreno, J. (2006). The LIA-Thales summarization system at DUC-2006. In

Proceedings of Document Understanding Conference.

Favre, B., Gillard, L., Torres-Moreno, J., Boudin, F., Bechet, F., and El-Beze, M. (2007).

The LIA summarization system at DUC-2007. In Proceedings of Document Under-

standing Conference.

Filatova, E. and Hatzivassiloglou, V. (2004). A formal model for information selection

in multi-sentence text extraction. In Proceedings of the 20th international conference

on Computational Linguistics, COLING ’04, Stroudsburg, PA, USA. Association for

Computational Linguistics.

62

Filippova, K. (2010). Multi-sentence compression: Finding shortest paths in word

graphs. In Proceedings of the 23rd International Conference on Computational Lin-

guistics, pages 322–330.

Finkel, J., Grenager, T., and Manning, C. (2005). Incorporating non-local information

into information extraction systems by gibbs sampling. In Proceedings of the 43rd

Annual Meeting on Association for Computational Linguistics, pages 363–370.

Fung, P. and Ngai, G. (2006). One story, one flow: Hidden markov story models for mul-

tilingual multidocument summarization. ACM Trans. Speech Lang. Process., 3:1–16.

Galley, M. (2006). A skip-chain conditional random field for ranking meeting utter-

ances by importance. In Proceedings of the 2006 Conference on Empirical Methods in

Natural Language Processing, EMNLP ’06, pages 364–372, Stroudsburg, PA, USA.

Association for Computational Linguistics.

Genest, P. and Lapalme, G. (2010). Text generation for abstractive summarization. In

Proceedings of the Third Text Analysis Conference, Gaithersburg, Maryland, USA.

National Institute of Standards and Technology.

Genest, P., Lapalme, G., and Yousfi-Monod, M. (2009). Hextac: the creation of a manual

extractive run. In Proceedings of the Second Text Analysis Conference, Gaithersburg,

Maryland, USA. National Institute of Standards and Technology.

Gupta, S., Nenkova, A., and Jurafsky, D. (2007). Measuring importance and query

relevance in topic-focused multi-document summarization. In Proceedings of the

45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions,

pages 193–196.

Harabagiu, S., Lacatusu, F., and Hickl, A. (2006). Answering complex questions with

random walk models. In Proceedings of the 29th annual international ACM SIGIR

conference on Research and development in information retrieval, pages 220–227,

Seattle, Washington, USA.

63

Hennig, L. (2009). Topic-based multi-document summarization with probabilistic la-

tent semantic analysis. In Recent Advances in Natural Language Processing, pages

144–149.

Hickl, A., Roberts, K., and Lacatusu, F. (2007). LCC’s GISTexter at DUC 2007: Ma-

chine reading for update summarization. In Proceedings of Document Understand-

ing Conference.

Hovy, E. and Lin, C.-Y. (1999). Automated text summarization in summarist. In Ad-

vances in Automatic Text Summarization, pages 82–94.

Hovy, E., yew Lin, C., and Zhou, L. (2005). Evaluating duc 2005 using basic elements.

In Proceedings of DUC-2005.

Jones, K. S. (1972). A statistical interpretation of term specificity and its application

in retrieval. Journal of Documentation, 28:11–21.

Jones, K. S. (1998). Automatic summarising: Factors and directions. In Advances in

Automatic Text Summarization, pages 1–12. MIT Press.

Jurafsky, D. and Martin, J. H. (2009). Speech and language processing: an introduction

to natural language processing, computational linguistics, and speech recognition.

Pearson Prentice Hall, Upper Saddle River, N.J., 2nd ed edition.

Katragadda, R. and Varma, V. (2009). Query-focused summaries or query-biased sum-

maries? In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages

105–108.

Klein, D. and Manning, C. D. (2003a). Accurate unlexicalized parsing. In Proceedings

of the 41st Annual Meeting on Association for Computational Linguistics - Volume

1, ACL ’03, pages 423–430, Stroudsburg, PA, USA. Association for Computational

Linguistics.

64

Klein, D. and Manning, C. D. (2003b). Fast exact inference with a factored model for

natural language parsing. In In Advances in Neural Information Processing Systems

15 (NIPS, pages 3–10. MIT Press.

Knight, K. and Marcu, D. (2000). Statistics-based summarization - step one: sentence

compression. In Proceedings of AAAI/IAAI, pages 703–710.

Lee, H., Peirsman, Y., Chang, A., Chambers, N., Surdeanu, M., and Jurafsky, D. (2011).

Stanford’s multi-pass sieve coreference resolution system at the conll-2011 shared

task. In Proceedings of the CoNLL-2011 Shared Task.

Lin, C. (2004). Rouge: A package for automatic evaluation of summaries. In Proceed-

ings of the workshop on text summarization branches out (WAS 2004).

Lin, C. and Hovy, E. (2000). The automated acquisition of topic signatures for text

summarization. In Proceedings of the 18th conference on Computational linguistics-

Volume 1, pages 495–501.

Lin, C. and Och, F. J. (2004a). Automatic evaluation of machine translation quality

using longest common subsequence and skip-bigram statistics. In Proceedings of the

42nd Annual Meeting fo the Association for Computational Linguistics.

Lin, C. and Och, F. J. (2004b). Orange: a method for evaluating automatic evaluation

metrics for machine translation. In Proceedings of the 20th international conference

on Computational Linguistics.

Lin, C.-Y. and Hovy, E. (2003). Automatic evaluation of summaries using n-gram co-

occurrence statistics. In Proceedings of the 2003 Conference of the North American

Chapter of the Association for Computational Linguistics on Human Language Tech-

nology - Volume 1, NAACL ’03, pages 71–78, Stroudsburg, PA, USA. Association for

Computational Linguistics.

Luhn, H. (1958). The automatic creation of literature abstracts. IBM Journal of re-

search and development, 2(2):159–165.

65

Manning, C. D. and Schutze, H. (1999). Foundations of statistical natural language

processing. MIT Press, Cambridge, Mass.

Marneffe, M. D., Maccartney, B., and Manning, C. D. (2006). Generating typed depen-

dency parses from phrase structure parses. In In LREC 2006.

McKeown, K. and Radev, D. R. (1995). Generating summaries of multiple news articles.

In Proceedings of the 18th annual international ACM SIGIR conference on Research

and development in information retrieval, SIGIR ’95, pages 74–82, New York, NY,

USA. ACM.

Miller, G. A. (1995). Wordnet: A lexical database for english. Communications of the

ACM, 38:39–41.

Molla, D. and Wan, S. (2006). Macquarie university at DUC 2006: Question answering

for summarisation. In Proceedings of Document Understanding Conference.

Moore, R. C. (2004). On log-likelihood-ratios and the significance of rare events. In

Proceedings of the 2004 Conference on Empirical Methods in Natural Language Pro-

cessing, pages 333–340.

Nenkova, A. and McKeown, K. (2011). Automatic summarization. Foundations and

Trends in Information Retrieval, 5(2-3):103–233.

Nenkova, A. and Vanderwende, L. (2005). The impact of frequency on summarization.

Microsoft Research, Redmond, Washington, Tech. Rep. MSR-TR-2005-101.

Paice, C. D. (1990). Constructing literature abstracts by computer: techniques and

prospects. Inf. Process. Manage., 26:171–186.

Passonneau, R. J., Nenkova, A., Mckeown, K., and Sigelman, S. (2005). Applying the

pyramid method in duc 2005. In In Proceedings of the 2005 DUC Workshop.

Piao, S. S., Rayson, P., Archer, D., and McEnery, T. (2005). Comparing and combining

a semantic tagger and a statistical tool for MWE extraction. Comput. Speech Lang.,

19(4):378–397.

66

Pingali, P., Varma, V., and Katragadda, R. (2007). IIIT hyderabad at DUC 2007. In

Proceedings of Document Understanding Conference.

Radev, D. R., Blair-Goldensohn, S., and Zhang, Z. (2001). Experiments in single and

multi-document summarization using MEAD. In Proceedings of Document Under-

standing Conference.

Raghunathan, K., Lee, H., Rangarajan, S., Chambers, N., Surdeanu, M., Jurafsky, D.,

and Manning, C. (2010). A multi-pass sieve for coreference resolution. In Proceedings

of EMNLP 2010.

Salton, G. and Buckley, C. (1988). Term-weighting approaches in automatic text re-

trieval. In INFORMATION PROCESSING AND MANAGEMENT, pages 513–523.

Salton, G., Wong, A., and Yang, C. S. (1975). A vector space model for automatic index-

ing. Commun. ACM, 18:613–620.

Schafer, U., Kiefer, B., Spurk, C., Steffen, J., and Wang, R. (2011). The acl anthology

searchbench. In Proceedings of the ACL-HLT 2011 System Demonstrations, pages 7

– 13.

tau Yih, W., Goodman, J., Vanderwende, L., and Suzuki, H. (2007). Multi-document

summarization by maximizing informative content-words. In In Proceedings of

IJCAI-07 (The 20th International Joint Conference on Artificial Intelligence.

Toutanova, K., Brockett, C., Gamon, M., Jagarlamudi, J., Suzuki, H., and Vander-

wende, L. (2007). The pythy summarization system: Microsoft research at DUC

2007. In Proceedings of Document Understanding Conference.

Toutanova, K., Klein, D., Manning, C., and Singer, Y. (2003). Feature-rich part-of-

speech tagging with a cyclic dependency network. In Proceedings of the 2003 Confer-

ence of the North American Chapter of the Association for Computational Linguistics

on Human Language Technology-Volume 1, pages 173–180.

67

Vanderwende, L., Banko, M., and Menezes, A. (2004). Event-centric summary genera-

tion. In Proceedings of Document Understanding Conference.

Vanderwende, L., Suzuki, H., Brockett, C., and Nenkova, A. (2007). Beyond sumbasic:

Task-focused summarization with sentence simplification and lexical expansion. Inf.

Process. Manage., 43:1606–1618.

Zhou, Q., Sun, L., and Lu, Y. (2006). ISCAS at DUC 2006. In Proceedings of Document

Understanding Conference.

68

Appendix A

APPENDIX A

A.1 TAC 2010 and 2011 Experiments: Example Summaries and Topic Signa-tures

The following sections provide example summaries and topic signatures for the best

and worst performing runs of counting method and unit of information definition ex-

periments on TAC 2010 and 2011 data.

A.1.1 TAC 2010 Summary A Experiments

The following table describes the type of counting method and unit of information

definition used to produce a collection of summaries for TAC 2010 and TAC 2011 data.

The standard label refers to the standard method of counting information units in

a topic by their occurrence in the sentences of the documents in the foreground and

background corpus. The dependency label refers to the proposed method of counting

information units in a topic by their participation in the sentence-based dependency

relations of the document in the foreground and background corpus.

69

Table A.1: Description of Experiments

RUN ID Type of Count Definition of Unit of Information





ID 79 standard word

ID 80 dependency word

ID 81 standard case-neutral lemmatized word

ID 82 dependency case-neutral lemmatized word



Best Performing Run for an Individual Topic

The summary with the highest average ROUGE-1, ROUGE-2, and ROUGE-SU4 F-

measures in the TAC 2010 summary A comparison was produced by runs 79, 80, and

81 for topic D1024F-A.

Table A.2: TAC 2010 summary D1024F-A: best performing unit of information defini-

tion

RUN ID ROUGE-1 ROUGE-2 ROUGE-SU4

ID 79 0.50782 0.21628 0.22931

ID 80 0.50782 0.21628 0.22931

ID 81 0.50782 0.21628 0.22931

All three runs produced the same summary:

Clinton said the missiles hit terrorist camps in Afghanistan run by Osama

bin Laden, the Saudi millionaire blamed by Washington for the Aug. 7

70

bombings of U.S. embassies in Kenya and Tanzania, and a factory

linked to bin Laden in Sudan. The letter, sent to council president Danilo

Turk of Slovenia, was intended to lodge Sudan’s formal complaint that

Thursday’s U.S. airstrikes on a Khartoum pharmaceutical factory were a

breach of the U.N. charter and a violation of its sovereignty. U.S. officials

said the factory in Sudan made chemical weapons agents; Sudan

maintains it’s a pharmaceutical plant.

The following tables feature the top 20 information units that make up the topic sig-

natures for runs 79, 80, and 81 for TAC 2010 summary D1024F-A.

Table A.3: TAC 2010 summary D1024F-A: run 79 topic signature (top 20)

Information Unit LLR Score Information Unit LLR Score

factory 518.74282 attack 112.17554

Sudan 302.78869 missiles 108.96030

Sudanese 250.42743 weapons 94.77731

U 223.23142 chemical 93.00228

Laden 188.38143 embassies 92.26304

Khartoum 179.53285 plant 88.46446

S 140.13310 strikes 83.79101

bin 128.42394 Kenya 83.79101

missile 121.74675 Osama 82.65559

Clinton 115.95864 American 81.28940

71



factory 1179.81867 strikes 209.62639

Laden 524.66512 divert 206.57105

U 433.34317 S 200.20465

embassies 308.50154 Kenya 188.08101

Sudanese 301.38225 bombings 186.00437

Sudan 283.55441 weapons 182.76680

plant 282.03231 Article 181.98718

attack 279.93579 Clinton 181.90102

Lewinsky 249.16919 self-defense 162.23661

missiles 226.77123 Khartoum 153.54951



FACTORY 494.94632 S 139.67224

SUDAN 302.78869 PHARMACEUTICAL 130.97561

SUDANESE 250.42743 EL-BASHIR 126.55758

MISSILE 230.58770 CLINTON 115.46943

U 223.23142 STRIKE 112.66561

LADEN 182.86320 WEAPON 87.14052

KHARTOUM 179.53285 KENYA 83.79101

BIN 167.62697 OSAMA 82.65559

EMBASSY 154.88831 AMERICAN 79.55654

ATTACK 148.31698 AFGHANISTAN 78.56720

72

Table A.6: Comparison of results for all TAC 2010 D1024F-A Summaries


ID 77 0.46626 0.17844 0.20916

ID 78 0.46398 0.17016 0.21354

ID 79 0.50782 0.21628 0.22931

ID 80 0.50782 0.21628 0.22931

ID 81 0.50782 0.21628 0.22931

ID 82 0.46642 0.17509 0.22399

ID 83 0.46659 0.19632 0.21830

ID 84 0.46398 0.17016 0.21354

Worst Performing Run for an Individual Topic

The summary with the lowest average ROUGE-1, ROUGE-2, and ROUGE-SU4 F-

measures in the TAC 2010 summary A comparison was produced by run 82 for topic

D1023E-A.

Table A.7: TAC 2010 summary D1023E-A: worst performing unit of information defi-

nition


ID 82 0.14962 0.01511 0.04313

The summary for D1023E-A run 82:

Grant Hill, Tim Duncan, Kevin Garnett, Gary Payton, Tim Hardaway, Steve

Smith, Tom Gugliotta, Allan Houston and Vin Baker have been chosen

as the first nine members of the 2000 U.S. Olympics team, The Associated

Press learned today. Grant Hill, Tim Duncan, Kevin Garnett, Gary Payton,

Tim Hardaway, Steve Smith, Tom Gugliotta, Allan Houston and Vin Baker

have been chosen as the first nine members of the 2000 U.S. Olympics

73

team, The Associated Press learned today. Both avalanches rushed down

the Alps to the Galtuer resort nestling in the Paznauntal valley after

4 p.m. local time (1500 GMT).

The following table features the top 20 information units that make up the topic sig-

nature for run 82 for TAC 2010 summary D1023E-A.

Table A.8: TAC 2010 summary D1023E-A: run 82 topic signature (top 20)


AVALANCHE 699.09858 ALBANIAN 141.72979

SNOW 574.26471 DIGGING 137.07591

RESORT 210.21897 AUSTRIAN 136.61398

GALTUER 186.29164 TODAY 132.10176

CLINTON 166.62981 CHALET 126.45269

HILL 161.08630 BURY 115.30729

KOSOVO 159.53228 RESCUER 114.60409

THUNDER 154.19949 FLY 113.97014

SNOWSLIDE 151.25360 TREATY 111.27681

AUSTRIA 142.63933 SCHOENHERR 109.95527

Table A.9: Comparison of results for all TAC 2010 D1023E-A Summaries


ID 77 0.19647 0.01272 0.05177

ID 78 0.16129 0.00501 0.03653

ID 79 0.19395 0.02799 0.05694

ID 80 0.20205 0.00775 0.05390

ID 81 0.20611 0.01285 0.05493

ID 82 0.14962 0.01511 0.04313

ID 83 0.12082 0.00260 0.03128

ID 84 0.20205 0.00775 0.05390

74

A.1.2 TAC 2010 Summary B Experiments



measures in the TAC 2010 summary B comparison was produced by run 77 for topic

D1002A-B.

Table A.10: TAC 2010 summary D1002A-B: best performing unit of information defi-

nition


ID 77 0.43678 0.17807 0.20875

The summary for D1002A-B run 77:

An appellate court ordered the trial of the four officers accused of killing

Amadou Diallo to be moved to Albany County, ruling on Thursday that a fair

trial would be impossible in the Bronx because of ‘‘the public clamor’’

about the case. The decision by a state appellate court to move the

criminal trial of four New York City police officers charged with the

killing of Amadou Diallo to Albany County seems unjustified. Jury

selection is scheduled to begin Jan. 31 in the trial of the four police

officers charged with killing Amadou Diallo, an unarmed West African

immigrant.

The following tables features the top 20 information units that make up the topic sig-

natures for run 77 for TAC 2010 summary D1002A-B.

75

Table A.11: TAC 2010 summary D1002A-B: run 77 topic signature (top 20)


N DIALLO 673.04948 N COURT 106.02233

N OFFICER 567.29590 ADJ FAIR 100.11662

N ALBANY 365.32632 N AMADOU 99.73738

N BRONX 360.79015 V FIRE 88.25419

N TRIAL 291.21263 N YORK 86.79814

N SHOOTING 159.36965 ADJ APPELLATE 75.56220

N POLICE 139.78544 ADJ UNARMED 75.56213

N SHARPTON 132.12847 N LAWYER 66.05793

N SHOT 108.54161 N BULLET 64.49419

N CARROLL 107.82134 N JUSTICE 62.92783

Table A.12: Comparison of results for all TAC 2010 D1002A-B Summaries


ID 77 0.43678 0.17807 0.20875

ID 78 0.32241 0.07682 0.10517

ID 79 0.41721 0.15810 0.18686

ID 80 0.41721 0.15810 0.18686

ID 81 0.42768 0.19314 0.20896

ID 82 0.41721 0.15810 0.18686

ID 83 0.42768 0.19314 0.20896

ID 84 0.41721 0.15810 0.18686



measures in the TAC 2010 summary B comparison was produced by run 78 for topic

D1030F-B.

76

Table A.13: TAC 2010 summary D1030F-B: worst performing unit of information defi-

nition


ID 78 0.13486 0.00000 0.02921

The summary for D1030F-B run 78:

These parents’ stories echo those of thousands of others who have recently

discovered age-old folk remedies, often with the recommendation of family

doctors who are adding herbal remedies _ for example, echinacea to stave

off colds and flu, chamomile or lavender to treat colic, calendula to soothe

diaper rash and ginger root to quell queasy little stomachs _ to their

disease-fighting arsenal. _ chamomile tea to calm frazzled nerves and

relieve stomach cramps _ ginger root, grated and simmered in water, to

prevent nausea from a bout of stomach flu or motion sickness and to help

children fall asleep.


nature for run 78 for TAC 2010 summary D1030F-B.

77

Table A.14: TAC 2010 summaryD1030F-B: run 78 topic signature (top 20)


N EPHEDRON 552.52603 N EFFECT 130.80751

N HERB 441.21402 N ECHINACEA 129.28822

ADJ HERBAL 305.48807 N STIMULANT 126.12133

N SUPPLEMENT 263.11803 V REGULATE 124.64983

N BELT 224.22809 N MEDICINE 117.59554

V TAKE 187.06863 N DIGGING 116.95930

N MEDICATION 181.25910 N PEDIATRICIAN 114.25856

N REMEDY 174.97739 N BOOK 113.33318

N STROKE 170.12477 N TEA 111.07923

N WORKOUT 168.19437 N PRODUCT 110.44459

Table A.15: Comparison of results for all TAC 2010 D1030F-B Summaries


ID 77 0.20759 0.01279 0.05291

ID 78 0.13486 0.00000 0.02921

ID 79 0.20698 0.01007 0.04739

ID 80 0.20698 0.01007 0.04739

ID 81 0.20759 0.01279 0.05291

ID 82 0.20759 0.01279 0.05291

ID 83 0.17067 0.00269 0.03705

ID 84 0.20698 0.01007 0.04739

A.1.3 TAC 2011 Summary A Experiments



measures in the TAC 2011 summary A comparison was produced by run 84 for topic

78

D1126E-A.

Table A.16: TAC 2011 summary D1126E-A: best performing unit of information defi-

nition


ID 84 0.49082 0.19098 0.20702

The summary for topic D1126E-A run 84:

President Bush on Sunday made a valedictory visit to Iraq, the country that

will largely define his legacy, but the trip will more likely be remembered

for the unscripted moment when an Iraqi journalist hurled his shoes at

Bush’s head and denounced him on live television as a "dog" who had

delivered death and sorrow here from nearly six years of war. Muntazer

al-Zaidi jumped up as Bush held a press conference with Iraqi Prime

Minister Nuri al-Maliki, shouted "It is the farewell kiss, you dog" and

threw his footwear.


nature for run 84 for TAC 2011 summary D1126E-A.

79

Table A.17: TAC 2011 summary D1126E-A: run 84 topic signature (top 20)


NNP BUSH 747.21575 NNS SHOE 196.69451

NN SHOE 386.26835 NN PRESIDENT 193.84583

NNP AL-MALIKI 317.02511 NN REPORTER 187.67958

JJ IRAQI 288.63369 NN TRIP 182.89375

VBD THROW 266.15993 VBD DUCK 168.88735

NNP IRAQ 264.28265 NN CONFERENCE 161.78289

NN AGREEMENT 236.96789 NN JOURNALIST 149.18060

NNP BAGHDAD 235.83468 NN KISS 148.40481

NNS TROOPS 211.61704 NNS JOURNALIST 142.91354

VBD HURL 207.45424 NN SIZE 142.67232

Table A.18: Comparison of results for all TAC 2011 D1126E-A Summaries


ID 77 0.41604 0.18481 0.20386

ID 78 0.38168 0.10540 0.13775

ID 79 0.41604 0.18481 0.20386

ID 80 0.41604 0.18481 0.20386

ID 81 0.41604 0.18481 0.20386

ID 82 0.42105 0.13165 0.16223

ID 83 0.41604 0.18481 0.20386

ID 84 0.49082 0.19098 0.20702



measures in the TAC 2011 summary A comparison was produced by runs 80, 82, and

84 for topic D1117C-A.

80

Table A.19: TAC 2011 summary D1117C-A: worst performing unit of information defi-

nition


ID 80 0.15056 0.02028 0.03395

ID 82 0.15056 0.02028 0.03395

ID 84 0.15056 0.02028 0.03395

The summary for D1117C-A runs 80,82, and 84 is:

Becoming the first senior officer fired over the poor treatment of wounded

soldiers, Major General George Weightman "was informed this morning

that the senior army leadership had lost trust and confidence in the

commander’s leadership abilities to address needed solutions for

soldier-outpatient care at Walter Reed Army Medical Center," the army

said in a statement. But as far back as 2003, the commander of Walter

Reed, Lt. Gen. Kevin Kiley, who is now the Army’s top medical officer,

was told that soldiers who were wounded in Iraq and Afghanistan

were languishing and lost on the grounds, according to interviews.


natures for run 80, 82, and 84 for TAC 2011 summary D1117C-A.

81

Table A.20: TAC 2011 summary D1117C-A: run 80 topic signature (top 20)


Reed 782.82771 care 247.92184

Army 658.97329 facility 217.90137

Walter 468.65233 veterans 181.02575

soldiers 427.09281 wounded 180.00340

conditions 363.76941 Kiley 179.38107

Post 339.17713 mold 170.12644

Building 301.23550 Priest 163.38436

Gates 289.25123 Cody 157.46524

Center 286.72261 fix 153.34708

bureaucracy 277.00485 secretary 142.58853



REED 782.82771 ARMY 218.34815

POST 623.74514 WALTER 217.53469

CENTER 468.65233 SOLDIER 215.72485

SECRETARY 436.37770 FACILITY 205.86814

KILEY 326.78160 GATES 179.38107

MOLD 289.25123 CONDITION 167.66782

PRIEST 280.72443 BUREAUCRACY 163.38436

BUILDING 270.93169 OUTPATIENT 161.47328

CODY 250.89293 CARE 157.46524

VETERAN 248.84959 FIX 146.45523

82



NNP REED 1233.00641 VBN RELIEVE 338.54511

NNP ARMY 787.92119 NNS PROBLEM 322.73028

NNP WALTER 678.26920 NNP CENTER 280.95158

NNS SOLDIER 529.87476 NNS CONDITION 258.38487

NNP KILEY 492.97587 NN COMMANDER 241.12500

NN CARE 447.08921 NN OUTPATIENT 228.00529

NNP HARVEY 401.47876 NN COMMAND 201.51052

NNP GATES 376.40115 NNP YOUNG 198.37828

NNP WEIGHTMAN 354.53460 VB FIX 182.33912

NNP POST 349.57793 NN TREATMENT 180.97485

Table A.23: Comparison of results for all TAC 2011 D1117C-A Summaries


ID 77 0.32633 0.07241 0.10706

ID 78 0.16898 0.03058 0.04146

ID 79 0.32866 0.07208 0.10519

ID 80 0.15056 0.02028 0.03395

ID 81 0.32866 0.07208 0.10519

ID 82 0.15056 0.02028 0.03395

ID 83 0.32633 0.07241 0.10706

ID 84 0.15056 0.02028 0.03395

A.1.4 TAC 2011 Summary B Experiments



measures in the TAC 2011 summary B comparison was produced by runs 79 and 81

83

for topic D1120D-B.

Table A.24: TAC 2011 summary D1120D-B: best performing unit of information defi-

nition


ID 79 0.45609 0.16424 0.19731

ID 81 0.45609 0.16424 0.19731

The summary for D1120D-B runs 79 and 81:

Lake Mead, the vast reservoir for the Colorado River water that sustains

the fast-growing cities of Phoenix and Las Vegas, could lose water

faster than previously thought and run dry within 13 years, according

to a new study by scientists at the Scripps Institution of Oceanography.

The lake, located in Nevada and Arizona, has a 50 percent chance

of becoming unusable by 2021, the scientists say, if the demand for

water remains unchanged and if human-induced climate change

follows climate scientists’ moderate forecasts, resulting in a reduction

in average river flows.


natures for runs 79 and 81 for TAC 2011 summary D1120D-B.

84

Table A.25: TAC 2011 summary D1120D-B: run 79 topic signature (top 20)


Colorado 294.09002 drought 79.62392

water 205.89267 Utah 71.97667

Lake 181.08896 Arizona 66.15587

Mead 149.12705 dry 59.80377

climate 116.98762 reservoir 58.41668

River 112.31465 flows 51.53275

Barnett 106.15135 Pierce 46.58239

states 93.29591 Reclamation 46.58236

Nevada 89.67553 change 46.40240

Powell 80.49528 West 45.89424

Table A.26: TAC 2011 summary D1120D-B: run 81 topic signature (top 20)


COLORADO 294.09002 RESERVOIR 79.80372

LAKE 213.34136 UTAH 71.97667

WATER 208.00949 ARIZONA 66.15587

RIVER 159.50094 CHANGE 59.62173

MEAD 149.12705 ENERGY 58.15841

CLIMATE 137.27292 DRY 55.12751

BARNETT 106.15135 RECLAMATION 46.58236

NEVADA 89.67553 ANALYSIS 44.68527

DROUGHT 83.33805 SCRIPPS 43.35238

POWELL 80.49528 CANYON 43.35238

85

Table A.27: Comparison of results for all TAC 2011 D1120D-B Summaries


ID 77 0.36970 0.09598 0.14078

ID 78 0.28645 0.07562 0.08978

ID 79 0.45609 0.16424 0.19731

ID 80 0.33633 0.05966 0.10031

ID 81 0.45609 0.16424 0.19731

ID 82 0.33633 0.05966 0.10031

ID 83 0.28645 0.07562 0.08978

ID 84 0.32617 0.06386 0.09918



measures in the TAC 2011 summary B comparison was produced by all runs for topic

D1112C-B.

Table A.28: TAC 2011 summary D1112C-B: worst performing unit of information defi-

nition


ID 77 0.17281 0.00257 0.03875

ID 78 0.17281 0.00257 0.03875

ID 79 0.17281 0.00257 0.03875

ID 80 0.17281 0.00257 0.03875

ID 81 0.17281 0.00257 0.03875

ID 82 0.17281 0.00257 0.03875

ID 83 0.17281 0.00257 0.03875

ID 84 0.17281 0.00257 0.03875

The summary for D1112C-B for all runs:

86

Along with Romero and McKeown, those killed were sheriff’s Deputy James Tutino,

47, of Simi Valley in Ventura County, who took the commuter train occasionally

to get to work at the Men’s Central Jail in downtown Los Angeles; Elizabeth Hill,

65, of Van Nuys; Manuel Alcala, 51, of West Hills; Julia Bennett, 44, of Simi

Valley; Alonso Caballero of Winnetka; Don Wiley, 58, of Simi Valley; William

Parent, 53, of Canoga Park; Thomas Ormiston, 58, of Northridge, who was was

nearing retirement in a railroad career that began in 1970; and Henry Kilinski,

39, of Orange in Orange County.


nature for run 78 for TAC 2010 summary D1112C-B.

Table A.29: TAC 2011 summary D1112C-B: run 77 topic signature (top 20)


N ALVAREZ 1175.35689 N MURDER 108.21355

N JURY 296.09226 N JUAN 102.18704

N PENALTY 173.60611 N DEATH 93.54109

N METROLINK 164.22542 N SENTENCING 92.64510

N TRAIN 152.42272 N ROMERO 87.00721

N JUROR 135.76091 N JUDGE 79.44059

N LIFE 127.12369 N MANUEL 79.33694

N POUNDERS 117.80438 N APPEAL 76.23846

N TRACK 115.08398 N PAROLE 73.62145

N DERAILMENT 110.09067 N SUPERIOR 71.68222

Documents

Calculating LLR Topic Signatures with Dependency Relations