Generic Summarization and Keyphrase Extraction Using Mutual Reinforcement Principle and Sentence Clustering Hongyuan Zha Department of Computer Science

Generic Summarization and Keyphrase Extraction Using Mutual Reinforcement Principle and Sentence Clustering

Hongyuan ZhaDepartment of Computer Science & Engineering

Pennsylvania State UniversityUniversity Park, PA 16802

[email protected]

SIGIR ’02

Introduction Informally, the goal of text summarization is to take a

textual document, extract content from it and present important content to the user in a condensed form and in a manner sensitive to the user’s or application’s needs.

Two basic approach to sentence extraction Supervised approaches need human-generated summary

extracts for feature extraction and parameter estimation, sentence classifiers are trained using human-generated sentence-summary pairs as training examples.

We adopt the unsupervised approach, explicitly model both keyphrase and the sentences that contain them using weighted undirected and weighted bipartite graphs.

Introduction First cluster sentences of a (set of) document(s) into

topical groups and then select the keyphrase and sentences by their saliency score within each group.

Major contributions are Proposing the use of sentence link priors resulted from the

linear order to enhance sentence clustering quality. Develop the mutual reinforcement principle for

simultaneous keyphrase and sentence saliency score computation.

The Mutual Reinforcement Principle For each document we generate two sets of objects: the set

of terms T = {t1,…,tn}, the set of sentences S = {s1,…,sm}.

Build a weighted bipartite graph: if term ti appears in sentence sj, then create an edge and specify nonnegative weight wij between them.

We can simply choose wij to be the number of times ti appears in sj, more sophisticated schemes will be discussed later.

Hence G(T, S, W) is a weighted bipartite graph where W = [wij] is an m-by-n weight matrix, and we wish to compute saliency scores u(ti) and v(sj).

The Mutual Reinforcement Principle Mutual reinforcement principle:

The saliency score of a term is determined by saliency scores of the sentence it appears in, and the saliency scores of a sentence is determined by the saliency scores of the terms it contains. Mathematically,

The Mutual Reinforcement Principle Written in matrix format,

We can rank terms and sentences in decreasing order of their saliency scores and choose top n terms or sentences.

Choose an initial value of v to be the vector of all ones, alternate between the following two steps until convergence:

1. Compute and normalize 2. Compute and normalize σ can be computed as upon convergence.

The Mutual Reinforcement Principle The above weighted bipartite graph can be extended by

adding vertex weights to the sentences and/or the terms.

For example, the weight of a sentence vertex can be increased if it contains certain bonus words.

In general, let DT and DS be two diagonal matrices the diagonal elements of which represent the weights of the term and sentence, we compute the largest singular value triplet {u,σ,v} of the scaled matrix DT W DS.

Clustering Sentences into Topical Groups The saliency score can be more effective if it is applied

within each topical group of a document.

For sentence clustering we build an undirected weighted graph with vertices representing sentences and two sentences si and sj are linked if there are terms shared by them, weight wij indicates the similarity between si and sj, and there are many different ways for their specification.

Sentences arranged in linear order, and near-by sentences tend to be about the same topic.

Topical groups are usually made of sections of consecutive sentences is a strong prior which we call sentence link prior.

Incorporating Sentence Link Priors We call si and sj are near-by if si is followed by sj. A

simple approach to take advantage of sentence link prior is to modify the weights:

We call α the sentence link prior, and use the idea of generalized cross-validation (GCV) to choose α.

Note that incorporating sentence link prior is different from text segmentation: we do allow several sections of consecutive sentences to form a single topical group.

Incorporating Sentence Link Priors For fixed α, apply the spectral clustering technique to

obtain a set of Π*(α). Define to γ(Π) be the number of consecutive sentence segments it generates, we compute a function of αas:

We then select the α that maximizes the above function as the estimated optimal α value.

Sum-of-Squares Cost Function and Spectral Relaxation

In the bipartite graph G(T, S, W) each sentence is represented by a column of W = [w1,…wn] which we call sentence vector. A partition Π can be written as:

E is a permutation matrix, and Wi is m-by-ni.

For a given partition, the sum-of-squares cost function is:

mi is the centroid of the sentence vectors in cluster i, and ni is the number of sentences in cluster i.


Traditional K-means algorithm is iterative and in each iteration the following is performed:

For each w, find mi that is closest to it, associate w with this mi.

Compute a new set of centers. Major drawback is prone to local minima giving rise to very

few data points.

An equivalent formulation can be derived as a matrix trace max. problem, it also makes K-means method easily adaptable to utilizing the sentence link priors.


Let e be a vector of appropriate dimension with all elements equal to one, thus

The sum-of-squares cost function can be written as:

And its minimum is equivalent to:

Let X be an arbitrary orthonormal matrix, we obtain a relaxed matrix trace max. problem:


An extension of the Rayleigh-Ritz characterization of eigenvalues of symmetric matrices shows that the above maximum is achieved by the first k largest eigenvectors of the Gram matrix WTW. We also have the following inequality:

This gives a lower bound for the minimum of the sum-of-squares cost function. In particular, we can replace WTW by WS(α) after incorporating the link strength.


The clustering label assignment is done by QR decomposition with pivoting:

Compute the k eigenvectors Vk = [v1,…,vk] of WS(α) corresponding to the largest k eigenvalues.

Compute the pivoted QR decomposition of VkT as

where Q is a k-by-k orthogonal matrix, R11 is a k-by-k upper triangular matrix, and P is a permutation matrix.

Compute

Then the clustering number is determined by the row index of the largest element in absolute value of the corresponding column of R hat.

Experimental Results Evaluation is a challenging task: human-generated

summarization tends to be different; another approach is to extrinsically evaluate their performance on, for example, document retrieval or text categorization.

We collect 10 documents, manually divide each one into topical groups.

Notice that the clustering is not unique, some clusters can merge into a bigger cluster and some can be split into finer structures.

Experimental Results

Experimental Results In processing the documents,

Delete stop words and applied Porter’s stemming. Construct WS = (wij): each sentence is represented by a

column of W, and wij is equal to the dot-product of si and sj. The sentence vectors are weighted with tf.idf weighting and

normalized to have Euclidean length one.

To measure the quality of clustering, we assume the manually generated section number is the true cluster label.

Here we use a greedy algorithm to compute a sub-optimal solution.

Experimental Results For a sequence of α, apply the spectral clustering

algorithm to the weight matrix WS(α) of the document dna.

We also plot the clustering accuracy against α, and contrast the clustering result with and without sentence link priors.

The clustering algorithm matches the section structure poorly when there is no near-by sentence constraints (i.e. α= 0); with too large α,sentence similarities are overwhelmed by link strength, the results are also poor.

Experimental Results

Experimental Results GCV method is quite efficient at choosing good α. In

Table 1, the estimated α may differ from optimal α but still produces accuracy matches well the best ones.

Experimental Results For the computation of keyphrase and sentence

saliency scores, apply the weight when applying the mutual reinforcement principle. For the i-th sentence apply the weight:

The idea is to mitigate the influence of long sentences by scaling by a factor proportional to the sentence length, at the same time sentences close to the beginning of the document get a small boost.

Experimental Results We use the document dna for illustration. For α= 3.5,

the clustering matches section structure well except for cluster 8.

Sentences 1 to 4 discuss issues related the common ancestor “Eve”, and section 4 with heading Defining mitochondrial ancestors is about the same topic.

Here sentence similarities win over sentence link strength.

We also applied the mutual reinforcement principle to all sentences and extracted first few keywords and sentences.

Conclusions We presented a novel method for simultaneous

keyphrase extraction and generic text summarization. Exploring the sentence link priors embedded in the linear

order of a document to enhance clustering quality. Also develop the mutual reinforcement principle to compute

keyphrases and saliency scores within each topical groups.

Many issues need further investigation: More research needs to be done for choosing optimal α. Other possible way for clustering, for example, using 2-

stage method: 1) segment sentences 2)cluster into topical groups.

Replacing the use of simple terms by noun phrases, this will impact W and WS.

Extension to translingual summarization.

Documents

Generic Summarization and Keyphrase Extraction Using Mutual Reinforcement Principle and Sentence Clustering Hongyuan Zha Department of Computer Science