View
248
Download
5
Tags:
Embed Size (px)
Citation preview
IntroductionIntroduction
What is Text Summarization?What is Text Summarization?
IntroductionIntroduction
What is Text Summarization?
A summary.
What is Text Summarization?
A summary.
IntroductionIntroduction
What is Text Summarization?
An automatically generated summary.
What is Text Summarization?
An automatically generated summary.
IntroductionIntroduction
What is Text Summarization?
An automatically generated summary of a document or collection.
What is Text Summarization?
An automatically generated summary of a document or collection.
IntroductionIntroduction
What is Text Summarization?
An automatically generated summary of a document or collection which is at
least as good as a human can produce.
What is Text Summarization?
An automatically generated summary of a document or collection which is at
least as good as a human can produce.
IntroductionIntroduction
We do not know good ways of doing it, so what are some other fields that we can borrow from to do what we need to do?
Information ExtractionInformation RetrievalText MiningText Generation
We do not know good ways of doing it, so what are some other fields that we can borrow from to do what we need to do?
Information ExtractionInformation RetrievalText MiningText Generation
Types of Text SummarizationTypes of Text Summarization
What types of summaries are there?
indicative versus informativeextract versus abstractgeneric versus query-orientedbackground versus just-the-newssingle-document versus multi-document source
What types of summaries are there?
indicative versus informativeextract versus abstractgeneric versus query-orientedbackground versus just-the-newssingle-document versus multi-document source
Types of Text SummarizationTypes of Text Summarization
Summarization tasks can vary on what information is considered as the source:
Summaries can look at all the information in a document(s) or
only the information that is deemed relevant for a specific task
Summarization tasks can vary on what information is considered as the source:
Summaries can look at all the information in a document(s) or
only the information that is deemed relevant for a specific task
Types of Text SummarizationTypes of Text Summarization
This can be re-stated as:
top-down (query-driven focus)
versus
bottom-up (text-driven focus)
This can be re-stated as:
top-down (query-driven focus)
versus
bottom-up (text-driven focus)
What do Human Summarizers Do?What do Human Summarizers Do?
Generally,
delete extraneous information
generalize concepts
make concepts more compact
Generally,
delete extraneous information
generalize concepts
make concepts more compact
What do Human Summarizers Do?What do Human Summarizers Do?
Example:
Father was washing dishes. Mother was working on her new book. The daughter was busy painting the window frames.
After summarization:
The whole family was busy.
Example:
Father was washing dishes. Mother was working on her new book. The daughter was busy painting the window frames.
After summarization:
The whole family was busy.
What do Human Summarizers Do?What do Human Summarizers Do?
Example 2:
Father was washing dishes. Mother was working on her new book. The daughter was busy painting the window frames. All of a sudden, the publisher called in and told mother that he needed the manuscript a month earlier than foreseen. Father left the dishes and finished the drawings instead. The daughter dropped the brush and rushed to do the proofreading. Supported by her family, mother managed to finish her book in time.
Example 2:
Father was washing dishes. Mother was working on her new book. The daughter was busy painting the window frames. All of a sudden, the publisher called in and told mother that he needed the manuscript a month earlier than foreseen. Father left the dishes and finished the drawings instead. The daughter dropped the brush and rushed to do the proofreading. Supported by her family, mother managed to finish her book in time.
What do Human Summarizers Do?What do Human Summarizers Do?
Topic of story has shifted
Example stresses importance of understanding entire story before abstracting from it
Humans read entire document before summarizing
Computational approaches can look at entire document or subpart related to task
Topic of story has shifted
Example stresses importance of understanding entire story before abstracting from it
Humans read entire document before summarizing
Computational approaches can look at entire document or subpart related to task
What do Human Summarizers Do?What do Human Summarizers Do?
Discourse Cues that aid in summarization:
knowledge of the topic domainsyntactic cues (topic-comment, connectives (but,
however, because, for example))stylistic and rhetorical cues (The most pressing
thing to do was, I conclude that)structural cues (narrative structure)context or situational cues
Discourse Cues that aid in summarization:
knowledge of the topic domainsyntactic cues (topic-comment, connectives (but,
however, because, for example))stylistic and rhetorical cues (The most pressing
thing to do was, I conclude that)structural cues (narrative structure)context or situational cues
What do Human Summarizers Do?What do Human Summarizers Do?
General strategies:
What to keep: facts, items relating to the topic, items that discuss purpose, items that are stated positively, items that contrast other items, items that are stressed
What to delete: reasons, comments, examples
General strategies:
What to keep: facts, items relating to the topic, items that discuss purpose, items that are stated positively, items that contrast other items, items that are stressed
What to delete: reasons, comments, examples
What do Human Summarizers Do?What do Human Summarizers Do?
Studies on consistency found that when abstracting documents:
Single human subjects vary widely in consistency using the same article over two different periods of time
Variation among different abstractors was even more significant
Even without a lot of consistency, all abstracts produced were adequate
Studies on consistency found that when abstracting documents:
Single human subjects vary widely in consistency using the same article over two different periods of time
Variation among different abstractors was even more significant
Even without a lot of consistency, all abstracts produced were adequate
Computational ApproachesComputational Approaches
How do we do Text Summarization?
Knowledge-based
Selection-based
How do we do Text Summarization?
Knowledge-based
Selection-based
Historical ApproachesHistorical Approaches
First text summarization algorithm by Luhn (1958):
1. words are input from the text;
2. common/non-substantive words are deleted through table look-up;
3. content words are stored, along with their position in the text, as well as any punctuation that is located immediately to the left and/or right of the word;
4. content words are sorted alphabetically
First text summarization algorithm by Luhn (1958):
1. words are input from the text;
2. common/non-substantive words are deleted through table look-up;
3. content words are stored, along with their position in the text, as well as any punctuation that is located immediately to the left and/or right of the word;
4. content words are sorted alphabetically
Historical ApproachesHistorical Approaches
Luhn Algorithm (cont.)
5. similar spellings are consolidated into word types (a rough approximation of a stemmer)
5a. any token with less than seven letter non-matches are considered to be of the same word type:
Luhn Algorithm (cont.)
5. similar spellings are consolidated into word types (a rough approximation of a stemmer)
5a. any token with less than seven letter non-matches are considered to be of the same word type:
frequently
frequent
10 letters,
8 match, 2 non-match
frequently
frequent
10 letters,
8 match, 2 non-match
Historical ApproachesHistorical Approaches
Luhn Algorithm (cont.):
5b. the frequencies of word types are compared
5c. low frequencies deleted
5d. remaining words were considered significant
Problems: anaphora white elephant
those big animals
they are big and white
Luhn Algorithm (cont.):
5b. the frequencies of word types are compared
5c. low frequencies deleted
5d. remaining words were considered significant
Problems: anaphora white elephant
those big animals
they are big and white
Historical ApproachesHistorical Approaches
Luhn Algorithm (cont.)
6. remaining word types are sorted into location order;
7. sentence representativeness determined by dividing sentences into substrings defined by distances between significant words
Luhn Algorithm (cont.)
6. remaining word types are sorted into location order;
7. sentence representativeness determined by dividing sentences into substrings defined by distances between significant words
Better to see you with, my dear
Better to
to see
you with, my
with, my dear
Better to see you with, my dear
Better to
to see
you with, my
with, my dear
Historical ApproachesHistorical Approaches
Substring 1: Better (2) to
Substring 2: to see(4)
Substring 3: you(6) with, my
Substring 4: with, my dear(1)
Better to see you with, my dear.
Substring 1: Better (2) to
Substring 2: to see(4)
Substring 3: you(6) with, my
Substring 4: with, my dear(1)
Better to see you with, my dear.
Substring 1: 2/2=1
Substring 2: 4/2=2
Substring 3: 6/3=2
Substring 4: 1/3=0.333
Total value for sentence = 5.33
Substring 1: 2/2=1
Substring 2: 4/2=2
Substring 3: 6/3=2
Substring 4: 1/3=0.333
Total value for sentence = 5.33
8. for each substring, a representativeness value was calculated by dividing the
number of representative tokens in the cluster by the total number of tokens in the cluster;
9. sentences reaching a value above a preset threshold were selected for inclusion
Historical ApproachesHistorical Approaches
TRW (1960s) builds upon Luhn model by:
adding weights for words that occurred in the title or subtitles of the document
sentences earlier or later in a paragraph were given higher weights than those in the middle
However, largest drawback at this point is that whole sentences are extracted, not rewritten.
TRW (1960s) builds upon Luhn model by:
adding weights for words that occurred in the title or subtitles of the document
sentences earlier or later in a paragraph were given higher weights than those in the middle
However, largest drawback at this point is that whole sentences are extracted, not rewritten.
Historical ApproachesHistorical Approaches
Models Influenced by Cognitive Science
make use of frames and scripts to simulate schemas, which are formats of knowledge representations
FRUMPPAULINE
Models Influenced by Cognitive Science
make use of frames and scripts to simulate schemas, which are formats of knowledge representations
FRUMPPAULINE
Historical ApproachesHistorical Approaches
FRUMP:
expectation driven modelknowledge base are sketchy scripts looks for instances of the knowledge-base in the text to
be summarizedFull parsing is not necessary for this method to work
FRUMP:
expectation driven modelknowledge base are sketchy scripts looks for instances of the knowledge-base in the text to
be summarizedFull parsing is not necessary for this method to work
Historical ApproachesHistorical Approaches
PAULINE:
pragmatically drivencan generate 100 different summaries from 1 originalinitially asks user for information to help guide its behaviorasks user for conversation topicscollects information on the topic and then creates sentencespragmatics that are used include: make listener like me, use a
"highfalutin" tone of voice, persuade the listener to change their opinion
PAULINE:
pragmatically drivencan generate 100 different summaries from 1 originalinitially asks user for information to help guide its behaviorasks user for conversation topicscollects information on the topic and then creates sentencespragmatics that are used include: make listener like me, use a
"highfalutin" tone of voice, persuade the listener to change their opinion
Current ApproachesCurrent Approaches
Newer methods are characterized by:
stochastic methodsintegration of corpus linguisticsshallow parsing methodslexical semantics knowledge through use of WordNetintegration of different methods in one modelsummarization from structured knowledgeintegration of information from different media
Newer methods are characterized by:
stochastic methodsintegration of corpus linguisticsshallow parsing methodslexical semantics knowledge through use of WordNetintegration of different methods in one modelsummarization from structured knowledgeintegration of information from different media
Current ApproachesCurrent Approaches
Using related fields:
IE DB Compression Text Generation
Using related fields:
IE DB Compression Text Generation
Current ApproachesCurrent Approaches
Think Smaller!
[Sentence Compression]
Think Smaller!
[Sentence Compression]
Current ApproachesCurrent Approaches
Sentence Compression
Noisy Channel
Sentence Compression
Noisy Channel
Current ApproachesCurrent Approaches
Sentence Compression
Source
Channel
Decoder
Sentence Compression
Source
Channel
Decoder
Current ApproachesCurrent Approaches
Sentence Compression
Focus of the Compression
Sentence Compression
Focus of the Compression
Current ApproachesCurrent Approaches
Sentence Compression
Sentences or Trees?
Sentence Compression
Sentences or Trees?
Current ApproachesCurrent Approaches
Sentence Compression
Q: So, how do we do it?
A: Probability that original sentence is an expansion of generated sentence
Sentence Compression
Q: So, how do we do it?
A: Probability that original sentence is an expansion of generated sentence
Current ApproachesCurrent Approaches
Example
Beyond that basic level, the operations of the three products vary widely (1514588)
Example
Beyond that basic level, the operations of the three products vary widely (1514588)
Current ApproachesCurrent Approaches
Example
Beyond that level, the operations of the three products vary widely (1430374)
Example
Beyond that level, the operations of the three products vary widely (1430374)
Current ApproachesCurrent Approaches
Example
Beyond that level, the operations of the three products vary (1249223)
Example
Beyond that level, the operations of the three products vary (1249223)
Current ApproachesCurrent Approaches
Example
Beyond that basic level, the operations of the products vary (1181377)
Example
Beyond that basic level, the operations of the products vary (1181377)
Current ApproachesCurrent Approaches
Example
The operations of the three products vary widely (939912)
Example
The operations of the three products vary widely (939912)
Current ApproachesCurrent Approaches
Example
The operations of the products vary widely (872066)
Example
The operations of the products vary widely (872066)
Current ApproachesCurrent Approaches
Example
The operations of the products vary (748761)
Example
The operations of the products vary (748761)
Current ApproachesCurrent Approaches
Example
The operations of products vary (809158)
Example
The operations of products vary (809158)
Current ApproachesCurrent Approaches
Example
The operations vary (522402)
Example
The operations vary (522402)
Current ApproachesCurrent Approaches
Example
Operations vary (662642)
Example
Operations vary (662642)
Current ApproachesCurrent Approaches
Example
Finally, another advantage of broadband is distance.
Example
Finally, another advantage of broadband is distance.
Current ApproachesCurrent Approaches
Example
Finally another advantage of broadband is distance.
Example
Finally another advantage of broadband is distance.
Current ApproachesCurrent Approaches
Example
Another advantage of broadband is distance.
Example
Another advantage of broadband is distance.
Current ApproachesCurrent Approaches
Example
Advantage of broadband is distance.
Example
Advantage of broadband is distance.
Current ApproachesCurrent Approaches
Example
Another advantage is distance.
Example
Another advantage is distance.
Current ApproachesCurrent Approaches
Example
Advantage is distance.
Example
Advantage is distance.
Current ApproachesCurrent Approaches
Example
The documentation is typical of Epson quality; excellent.
Documentation is excellent.
Example
The documentation is typical of Epson quality; excellent.
Documentation is excellent.
Current ApproachesCurrent Approaches
Example
All of our design goals were achieved and the delivered performance matches the speed of the
underlying device.
All design goals were achieved.
Example
All of our design goals were achieved and the delivered performance matches the speed of the
underlying device.
All design goals were achieved.
Current ApproachesCurrent Approaches
Example
Reachs E-mail product, MailMan, is a message- management system designed initially for VINES LANs that will eventually be operation system-
independent.
MailMan will eventually be system-independent.
Example
Reachs E-mail product, MailMan, is a message- management system designed initially for VINES LANs that will eventually be operation system-
independent.
MailMan will eventually be system-independent.
Current ApproachesCurrent Approaches
Example
Although the modules themselves may be physically and/or electronically incompatible, the cable-
specific jacks on them provide industry-standard connections.
Cable-specific jacks provide industry-standard connections.
Example
Although the modules themselves may be physically and/or electronically incompatible, the cable-
specific jacks on them provide industry-standard connections.
Cable-specific jacks provide industry-standard connections.
Current ApproachesCurrent Approaches
Example
Ingres/Star prices start at $2,100.
Ingres/Star prices start at $2,100.
Example
Ingres/Star prices start at $2,100.
Ingres/Star prices start at $2,100.
Current ApproachesCurrent Approaches
Example
Original: Beyond the basic level, the operations of the three products vary widely.
Baseline: Beyond the basic level, the operations of the three products vary widely.
Noisy-Channel: The operations of the three products vary widely.
Decision-based: The operations of the three products vary widely.
Humans: The operations of the three products vary widely.
Example
Original: Beyond the basic level, the operations of the three products vary widely.
Baseline: Beyond the basic level, the operations of the three products vary widely.
Noisy-Channel: The operations of the three products vary widely.
Decision-based: The operations of the three products vary widely.
Humans: The operations of the three products vary widely.
Current ApproachesCurrent Approaches
Example
Original: Arborscan is reliable and worked accurately in testing, but it produces very large dxf files.
Baseline: Arborscan and worked in, but very large dxf.
Noisy-Channel: Arborscan is reliable and worked accurately in testing, but it produces very large dxf files.
Decision-based: Arborscan is reliable and worked accurately in testing very large dxf files.
Humans: Arborscan produces very large dxf files.
Example
Original: Arborscan is reliable and worked accurately in testing, but it produces very large dxf files.
Baseline: Arborscan and worked in, but very large dxf.
Noisy-Channel: Arborscan is reliable and worked accurately in testing, but it produces very large dxf files.
Decision-based: Arborscan is reliable and worked accurately in testing very large dxf files.
Humans: Arborscan produces very large dxf files.
Current ApproachesCurrent Approaches
Example
Original: Many debugging features, including user-defined break points and variable-watching and message-watching windows, have been
added.
Baseline: Debugging, user-defined and variable-watching and message-watching, have been.
Noisy-Channel: Many debugging features, including user-defined points and variable-watching and message-watching windows,
have been added.
Decision-based: Many debugging features.
Humans: Many debugging features have been added.
Example
Original: Many debugging features, including user-defined break points and variable-watching and message-watching windows, have been
added.
Baseline: Debugging, user-defined and variable-watching and message-watching, have been.
Noisy-Channel: Many debugging features, including user-defined points and variable-watching and message-watching windows,
have been added.
Decision-based: Many debugging features.
Humans: Many debugging features have been added.
Future WorkFuture Work
Noisy Channel
Knowledge-Based - CYC
Other
Noisy Channel
Knowledge-Based - CYC
Other
SummarySummary
Text summarization has several different methods and subtasks and, like most recent developments in the area of CompLing, there is more to be done to make automatic processes match human expectations.
Text summarization has several different methods and subtasks and, like most recent developments in the area of CompLing, there is more to be done to make automatic processes match human expectations.