TOPIC 8: TEXT SUMMARIZATIONnlpcs724.weebly.com/.../1/...8-text_summarization.pdf · Text Summarization Goal: produce an reduced version of a text that contains information that is

TOPIC 8: TEXT SUMMARIZATIONNATURAL LANGUAGE PROCESSING (NLP)

CS-724

WondwossenMulugeta (PhD) email: [email protected]

1

Topics2

Topics Subtopics8: Text Summarization

• Application, • Types, • Single doc. Vs Multiple doc.

Summarization, • Sentiment Analysis

Text Summarization

Goal: produce an reduced version of a text that

contains information that is important or relevant to

understand the content.

Summarization Applications

outlines or abstracts of any document, article, etc

summaries of email threads

action items from a meeting

simplifying text by compressing sentences

News summarization

3

‘Types’ of Summary?

Indicative vs. informative...used for quick categorization vs. content processing.

Extract vs. abstract...lists fragments of text vs. re-phrases content coherently.

Generic vs. query-oriented...provides author’s view vs. reflects user’s interest.

Background vs. just-the-news...assumes reader’s prior knowledge is poor vs. up-to-date.

Single-document vs. multi-document source...based on one text vs. fuses together many texts.

4

Aspects that Describe Summaries

Input subject type: domain

genre: newspaper articles, editorials, letters, reports...

form: regular text structure; free-form

source size: single doc; multiple docs (few; many)

Purpose situation: embedded in larger system (MT, IR) or not?

audience: focused or general

usage: IR, sorting, skimming...

Output completeness: include all aspects, or focus on some?

format: paragraph, table, etc.

style: informative, indicative, aggregative, critical...

5

What to summarize?

Single-document summarization

Given a single document, produce

abstract

outline

headline

Multiple-document summarization

Given a group of documents, produce a gist of the content:

a series of news stories on the same event

a set of web pages about some topic or question

6

Query-focused Summarization

& Generic Summarization

Generic summarization:

Summarize the content of a document

Query-focused summarization:

summarize a document with respect to an

information need expressed in a user query.

a kind of complex question answering:

Query based summarization tries to answer a

question by summarizing a document that has the

information to construct the answer

7

Summarization for Q&A: Snippets

Google creates snippets summarizing a web page for a query.

Google: 156 characters (about 26 words) plus title and link

8

Extractive summarization

Extractive summarization:

create the summary from phrases or sentences in the source

document(s)

Selects some of the best describing sentences from the

whole text and present to the user

The number of sentences can be specified by number or

proportion

Eg:

take the first sentence

Take the first sentences of each paragraph

Take the sentence which contains high frequency words

9

Abstractive summarization

Abstractive summarization:

express the ideas in the source documents using (at least in

part) different words

More sophisticated summarization approach where the

system is expected to compose its own summary of the

whole text.

10

Summarization: Three Stages

1. content selection: choose sentences to extract from

the document

2. information ordering: choose an order to place them

in the summary

3. sentence realization: clean up the sentences

11

Document

Sentence

Segmentation

Sentence

Extraction

All sentences

from documents

Extracted

sentences

Information

Ordering

Sentence

Realization

Summary

Content Selection

Sentence

Simplification

Basic Summarization Algorithm

1. content selection: choose sentences to extract from

the document

2. information ordering: just use document order

3. sentence realization: keep original sentences

12

Document

Sentence

Segmentation

Sentence

Extraction

All sentences

from documents

Extracted

sentences

Information

Ordering

Sentence

Realization

Summary

Content Selection

Sentence

Simplification

Unsupervised content selection

Choose Relevant Sentences

Choose sentences that have salient or informative words

Two approaches to defining salient words

1. tf-idf: weigh each word wi in document j by tf-idf

tf: the frequency of a word in the document

idf: the inverse of the number of sentences (document)

containing the term (in most cases we use isf-inverse

sentences frequency)

We are looking for high value of tf and idf

13

weight(wi ) = tfij ´ idfi

Supervised content selection

Given:

a labeled training set of good summaries for each document

Align:

the sentences in the document with sentences in the summary

Extract features

position (first sentence?)

length of sentence

word informativeness, cue phrases

cohesion

Train

Problems:

hard to get labeled training data

alignment difficult

14

Recall Oriented Evaluation

Intrinsic metric for automatically evaluating summaries

Based on BLEU (a metric used for machine translation)

Not as good as human evaluation (“Did this answer the

user’s question?”)

But much more convenient

Given a document D, and an automatic summary X:

1. Have N humans produce a set of reference summaries of D

2. Run system, giving automatic summary X

3. What percentage of the bigrams from the reference

summaries appear in X?

15

A ROUGE:

Recall Oriented Evaluation Q:Query: “What is water spinach?”

Human 1: Water spinach is a green leafy vegetable grown in the tropics.

Human 2: Water spinach is a semi-aquatic tropical plant grown as a vegetable.

Human 3: Water spinach is a commonly eaten leaf vegetable of Asia.

System answer: Water spinach is a leaf vegetable commonly eaten in

tropical areas of Asia.

The ROUGE-2 evaluation takes the ratio of the sum of matched

bigram between the human and the system output against the sum of

bigrams in each human summary.

16

10 + 9 + 9

3 + 3 + 6= 12/28 =0.43 ROUGE-2 =

Sentiment Analysis

An Example Review

“I bought an iPhone a few days ago. It was such a nice phone.

The touch screen was really cool. The voice quality was clear

too. Although the battery life was not long, that is ok for me.

However, my mother was mad with me as I did not tell her

before I bought the phone. She also thought the phone was

too expensive, and wanted me to return it to the shop. …”

What do we see?

Opinions,

targets of opinions, and

opinion holders

18

Why sentiment analysis?

Movie: is this review positive or negative?

Products: what do people think about the new iPhone?

Public sentiment: how is consumer confidence? Is despair

increasing?

Politics: what do people think about this candidate or issue?

Prediction: predict election outcomes or market trends from

sentiment

19

Terms

Sentiment

A thought, view, or attitude, especially one based

mainly on emotion instead of reason

Sentiment Analysis

Also Known as: opinion mining, Opinion extraction,

Sentiment mining, Subjectivity analysis

use of natural language processing (NLP) and

computational techniques to automate the extraction or

classification of sentiment from typically unstructured

text

20

Sentiment Analysis

Simplest task:

Is the attitude of this text positive or negative?

More complex:

Rank the attitude of this text from 1 to 5

Advanced:

Detect the target, source, or complex attitude types

21

Problem

Which features to use?

Words (unigrams)

Phrases/n-grams

Sentences

How to interpret features for sentiment detection?

Bag of words (IR)

Annotated lexicons (WordNet, SentiWordNet)

Syntactic patterns

Paragraph structure

22

Baseline Algorithm

Tokenization

Feature Extraction

Classification using different classifiers

Naïve Bayes

MaxEnt

SVM

23

Extracting Features for Sentiment

Classification

How to handle negation

I didn’t like this movie

vs

I really like this movie

vs

I really hate this movie

Which words to use?

Only adjectives

All words

All words turns out to work better, at least on this data

24

End of Topic 8

Documents

TOPIC 8: TEXT SUMMARIZATIONnlpcs724.weebly.com/.../1/...8-text_summarization.pdf · Text Summarization Goal: produce an reduced version of a text that contains information that is