30
A progressive sentence selection strategy for document summarization You Ouyang , Wenjie Li , Renxian Zhang , Sujian Li , Qin Lu IPM 2013 Hao-Chin Chang Department of Computer Science & Information Engineering National Taiwan Normal University

A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

Embed Size (px)

Citation preview

Page 1: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

A progressive sentence selection strategy for document summarization

You Ouyang , Wenjie Li , Renxian Zhang , Sujian Li , Qin Lu

IPM 2013

Hao-Chin Chang

Department of Computer Science & Information Engineering

National Taiwan Normal University

2013/03/05

Page 2: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

2

Outline

• Introduction

• Methodology

• Experiments and evaluation

• Conclusion and future work

Page 3: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

Introduction

• Many studies– It is also well acknowledged that sentence selection strategies

are very important, which mainly aim at reducing the redundancy among the selected sentences to enable them to cover more concepts.

• Different from the existing methods, in our study we’d like to explore the idea of directly examining the uncovered part of the sentences for saliency estimation in order to maximize the coverage of the summary.

3

Page 4: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

Introduction

• To avoid the possible saliency problem, we make use of the subsuming relationship between sentences to improve the saliency measure. – The idea is to use the salient general concepts that are more

significant to help discover the salient supporting concepts.

• once we have selected a general word ‘‘school’’ in a sentence of the summary, we would like to select ‘‘student’’ or ‘‘teacher’’ in the next sentences.

• Sentence A: the schools that have vigorous music programs tend to have higher academic performance.

• Sentence B: among the lower-income students without music involvement, only 15.5% achieved high math scores.

• when sentence A is selected, how much we want to include another sentence B to support the ideas in sentence A.

4

Page 5: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

Identifying word relations

• 1. linguistic relation databases such as WordNet • 2. frequency-based statistics such as co-occurrence or pointwise

mutual information

• In our study, the target is to study the subsuming relations between the words in the input documents.– the association of two words is defined by two conditions:

– P(a|b) ≧0.8 , P(b|a) < P(a|b) Word a subsumes word b if the documents in which b occurs are a subset, or

nearly a subset, or nearly a subset, of the documents in which a occurs.

5

Page 6: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

Identifying word relations

• Sentence-level coverage– sometimes a document set just consists of only a few documents

– to get more available information, We intend to study the sentence-level co-occurrence statistics in our study

• Set-based coverage– Sentence-level co-occurrence is sparser than document-level co-

occurrence due to the shorter length of sentences.

– we intend to examine the coverage not only between two words, but also between a word and a word set

two common phrases King Norodom , Prince Norodom Norodom is almost entirely covered by the set {King , Prince }

6

Page 7: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

Identifying word relations

• Transitive reduction– The subsuming relation between two words also reflects the

recommendation status between them.– three words a, b, c that satisfy a > b, b > c and a > c

(a > b denotes a subsuming b),

– the long-term relationship a > c will be ignored

• Spanned sentence set – a word w in a document set D, whose sentence set is denoted by SD, is

defined as the set of the sentences that contain w

7

swSsswSPAN D |)(

Page 8: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

Identifying word relations

• an existing non-empty word set

• Concept coverage of a word w over W is devised to reflect to what extent w brings new information relative to the known information provided in W

• COV(w) is defined as the proportion of the sentences in SPAN(w) that appear in SPAN(W)

• The smaller the coverage is, the more likely w will bring new information to W

8

nwwwW ,...,, 21

wSPAN

wSPANwSPANWwCOV ii

|

Page 9: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

Identifying word relations

• When comparing the word w to a former word w0 that already subsumes a set of words S to align a relation between w and w0

– two constraints (0-1)

9

10| wwCOV 2| SwCOV

Page 10: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

The definition of the subsuming relationship

• the word set of s as W = • the word set of s’ as W’ = • Connected Word

– A word wi in W is regarded to be ‘‘connected’’ to a word w’j in W’ if it

satisfies the condition

– s.t.

• directly connects wi is wl1 The weight of the edge COV(wi |wl1 )

• strength of the connection between wi and w’j CON(wi | w’

j )

10

},...,{ 1 lww

},...,{ ''1 mww

'1,..., WWww lkl

'

1211 ... jlklkklllli wwwwwwww

Page 11: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

The definition of the subsuming relationship

• Conditional Saliency (CS for short) of s to s’ is calculated as a weighted sum of the importance of all the ‘‘connected words’’ in s to s’

11

ijiswi wscorewwCONMAXsLOGwssCS

j*|| ''

''

Page 12: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

Progressive sentence selection strategy

• It can be viewed as a random walking process on the DAG(directed acyclic graph) from the center to its surrounding nodes

• we introduce a virtual word besides the real words that do appear in the input documents

• The virtual word is used as the center of the DAG (denoted as

ROOT-W).• we can view it as a virtual word that spans the whole sentence set

so that it can perfectly cover any actual word.

12

Page 13: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

Progressive sentence selection strategy

• This virtual sentence ROOT-S is regarded as being already selected at the beginning of the sentence selection process.

• The conditional saliency of a sentence to ROOT-S just indicates its ability of describing the general ideas of the input documents because the words attached to ROOT-W are the general words.

13

Page 14: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

Progressive sentence selection strategy

• The sentence selection process is cast as: – first adding ROOT-S to the initial summary

– then iteratively adding the sentence that best supports the existing sentence(s) (denoted as Sold)

the score of each unselected sentence based on its conditional saliency to each selected sentence

This maximum saliency indicates how much supporting information

• When different sentences contain the same ‘‘connected words’’, they have equal scores– we use two popular criteria, length and position, to obtain the final

measure of the sentence score

14

sposslen

ssCSMAXssScore tssold oldt 1*

1*,|

Page 15: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

Redundancy control by penalizing repetitive words

• To ensure that the selected sentence always brings new concepts, a damping factor a is applied to the word importance during the sentence selection process

• In the extreme case when equals 0, an effective ‘‘connected word’’ is required not to appear in any selected sentence

15

ii wScorewScore *

Page 16: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

Experiments and evaluation

• Document Understanding Conference (DUC)– The proposed summarization methods are first evaluated on a

generic multi-document summarization data set– And then extended to several query-focused multi-document

summarization data sets

• we use the automatic evaluation toolkit ROUGE to evaluate the system summaries.

• DUC 2004 generic multi-document summarization data set – 45 set multi-document

– Each set consisting of 10 documents

16

Page 17: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

Experiments and evaluation

• The resulting summary tends to include more diverse words and thus stands a better chance to share more words with the reference summaries, which may lead to a higher ROUGE-1 score.

• the ROUGE-2 score may decrease even more as it requires matching two continuous words

• sequential system obtain the highest ROUGE-1 score with full penalty on repetitive words (a equals 0). However, the ROUGE-2 scores drop significantly

• ROUGE-2 scores are obtained when a equals 0.5, we can observe that the dropping rate is much lower for the progressive system.

17

Page 18: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

Experiments and evaluation

• This clearly demonstrates the advantages of the progressive sentence selection strategy guarantees the novelty and saliency of the sentences

18

Page 19: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

Experiments and evaluation

• the damping factor is used to handle the redundancy issue

• The reason is that it is more consistent with the word importance estimation method used in the systems and thus it is better in handling the redundancy for the system

19

Page 20: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

Experiments and evaluation

• is too small– many unrelated words may also be wrongly associated, which

unavoidably impairs the reliability of the word relations and leads to the worse performance

• is too large– the discovered word relations will be very limited and thus weaken the

progressive system

20

2

2

Page 21: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

Experiments and evaluation

• 2005-2007 DUC query-focused multi-document summarization data set – The data set in each year contains about 50 topics

– each topic consisting of 25–50 documents

– system-generated summaries are strictly limited to 250 English words in length

21

Page 22: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

Experiments and evaluation

• It is also shown that incorporating the query to refine the word importance is effective for both the progressive system and the sequential system.

22

Page 23: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

23

Conclusion and future work

• In the process, a sentence can be selected either as a general sentence or as a supporting sentence.

• The sentence relationship is used to improve the saliency of the supporting sentences.

• A fact is that a single word alone is often insufficient to represent a complex concept and the sense of a word can be ambiguous in a document set. – In future work, we’d like to explore concept relations

Page 24: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

24

Conclusion and future work

• due to the limitation of the current natural language generation techniques, automatic summarization systems still cannot freely compose ideal sentences like human do.

• In the future, we’d like to investigate other means to break the limitation of the original sentences, such as sentence compression or sentence fusion, which can generate additional candidate sentences in order to more accurately express the desired concepts

Page 25: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

Speech summarization

Page 26: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

Experiments Data

• 實驗語料 List– ds2_all_list.txt

100 訓練語料 List ds2_all_list_test.txt

105 測試語料 List ds2_all_list_train.txt

• 20 篇測試語料 List

– test_difficult.txt

• RM WRM 使用額外資訊的資料– 2002_News_Content.txt.seg

26

Page 27: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

Experiments Data

• BG DATA– CNA0102.GT3-7.lm.wid

N-gram, Lmscore (10 維底 要換成 e 為底 ), LMWID, Backoffscore

• 字典– NTNULexicon2003-72K.txt

AcousticWID, LMWID, N-gram, 中文字 注音符號 No tone syllableWID tone syllableWID

27

Page 28: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

Experiments Data

• ROUGE 字典– RougeDict.txt– a1 a2 a3– a (LMWID)

28

Page 29: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

Sentence modeling

• ULM KL

29

Vw SwP

DwPDwPSDKL

|

|log|||

BGCwS

SwcSwP |1

,|

Dw

DwcSwPSDP ,||

BGCwD

DwcDwP |1

,|

Page 30: A progressive sentence selection strategy for document summarization You Ouyang, Wenjie Li, Renxian Zhang, Sujian Li, Qin Lu IPM 2013 Hao-Chin Chang Department

Sentence modeling

• RM

30

BGRMoriginalnew CwPSwPSwPSwP |1|||

M

m

L

l mlmm

M

m mRM DsPDwPDPDSwPDPSwP1 11

)|()|()()|,()(),(