Third International Conference on Language Resources and Evaluation, Las Palmas, Canary Islands, Spain, 29-31 May 2002 Columbia University Catalogued recommended

Third International Conference on Language Resources and Evaluation, Las Palmas, Canary Islands, Spain, 29-31 May 2002

Columbia University

Catalogued recommended information from 5 prescriptive guidelines for A.B.E.’s

Using the Annotated Bibliography as a Resource for Indicative Summarization

Min-Yen Kan*, Judith L. Klavans** and Kathleen R. McKeown*{min, judith, kathy}@cs.columbia.edu

1. Extract versus Abstract2. Informative versus Indicative

3. Generic versus Query biased4. Single document versus Multiple

Selected Summary Dimensions

News Summaries

Scientific Summaries

Snippets

Card Catalog Entries

Annotated Bibliography Entries Corpus Collection & Encoding

Corpus AvailabilityThe corpus is available for academic and not-for-profit research, by request to:

<[email protected]>

An annotation guide, explaining the annotation tagging guidelines in more detail, is also available. Command-line and web .CGI utilities are also provided to modify, insert and extract attributes from the corpus.

* Departmentof ComputerScience

** Center for Research on Information Access

<bibEntry id="id26" title="Analysis of covariance" url="http://www.math.yorku.ca/SCS/biblio.html" type="paper" domain="statistics“ microCollection="Analysis of Covariance" offset="4">

<beforeContext>Maxwell, S. E., Delaney, H. D., & O'Callaghan, M. F. (1993). Analysis of...</beforeContext>

<entry><OVERVIEW>This <MEDIATYPES>paper </MEDIATYPES>gives a brief history of ANCOVA, and then discusses ANCOVA in ... contains no matrix algebra.</DIFFICULTY></entry>

<parsedEntry>PROB 14659 -112.252 0 TOP -112.252 S -105.049 NP-A -8.12201 NPB -7.82967 DT 0 This NN 0 paper ...</parsedEntry>

</bibEntry>

Our language resource of annotated bibliography entries was designed to ease the collection of the corpus as well as to make many features available for subsequent analysis for summarization and related natural language applications.

Presently:

- 1200 documents containing “annotated bibliography” were spidered - of those, 64 documents were hand parsed yield 2000 entries - of those 2000, 100 of the parsed <entry> tags were further annotated with semantic tags

<beforeContext>: the text before the body of the entry

the subject or theme location of the source document

coarser granularity than title

the position of the entry on the page

Other fields, also optional:

- <afterContext>: text that is distinctly marked off as coming after the entry - <macroCollection>: the division that the page represents in the set of related pages

the internal division in the page that this entry belongs to

<entry>: the text with the 24 semantic tags

<parsedEntry>: Collins’ 96 parse of the entry

Annotated Bibliography Entries are indicative summaries. - longer than both card catalog summaries and snippets

- organized around a theme; ideal standard for ``query-based'' summaries

- have explicit comparisons of one resource versus another

- have prefacing overviews of the documents in the bibliography.

- rich in meta-information features.

We study them as models for summaries, by examining prescriptive guidelines and performing a corpus study

Media Type 55 48%Author/Editor 43 27%Content Types/Special Feature x x 41 29%Subjective Assess/Coverage x x x x 36 24%Authority/Authoritativeness x x x 26 20%Background/Source 21 16%Navigation/Internal Structure x 16 11%Collection Size 13 10%Purpose x x x 13 10%Audience x x x x 12 12%Contributor 12 12%Cross-resource comparison x 10 9%Size/Length 9 7%Style 8 6%Query Relevance x x 4 3%Readability 4 3%Difficulty 4 4%Edition/Publication Information 3 3%Language 2 2%Copyright 2 1%Award/Quality/Defects x x x 2 1%

Detail 139 47%Overview 72 64%Topic 34 28%

Topicality Features

Prescriptive Guidelines Corpus Study

# tags in corpus % entries with tagRee70 EBC98 Les01 AACC98 Wil02

Metadata and Other Features

x x

consist of structured fields, of which a summary is an optional field. Other types of information (such as notes, or book jacket texts, or book reviews) are often substituted for summaries.

are short indicative descriptions given by authors of web pages. Often very short, (e.g. Yahoo! or ODP category pages). Amitay (2000) shows strategies for locating and extracting snippets and how to rank different ones for fitness as a summary.

There have been a number of studies using abstracts of scientific articles as a target summary (e.g., Kupiec et al 1995). Abstracts tend to summarize the document's topics well but do not include much use of metadata.

DUC provides a large corpus for informative summaries. Jing and McKeown (1999) use source document and target summary relations for ``cut and paste'' summarization.

Abstract Both Both Mostly Single Yes Corpus

Both Informative Generic Both Yes Corpus

Abstract Indicative Both Single Yes AlgorithmAbstract Indicative Generic Single Yes Corpus

Abstract Informative Generic Single No Corpus Mostly Extract Informative Generic Single No Corpus

Extract vs. Indicative vs. Generic vs. Single vs. Uses Corpus vs.Abstract Informative Query-based Multidocument Metadata? Algorithm

Scientific AbstractsSnippetsCard Catalog Entries

Ziff DavisDUC

Annotated Bibliography Entries

Corpus

Performed study of 100 entries (see right)

Documents

Third International Conference on Language Resources and Evaluation, Las Palmas, Canary Islands, Spain, 29-31 May 2002 Columbia University Catalogued recommended