6
Clustering of Chinese Sentences Using the SMM Model Tiansang DU Xinying XU Liang CHEN Baobao CHANG Institute of Computational Linguistics, Peking University, Beijing, PRC {tsdu, xuxinying, chenlianglucky, chbb}@pku.edu.cn sentences with the same meaning, then provide Abstract them to the lexicographer group by group, or just The purpose of this article is to research the provide with several sentences chosen in each clustn m d bgroup, the too-much-returned-in-searching prob- elustherin methodih bsed ohinestatisti lem can be solved. The purpose of this paper is to model, then deal with the Chinese sen- study how Chinese sentence can be grouped with tence clustering problem on bilingual lexi- the background of corpus-based lexicography. cographical platform. In the view of co- Clustering is an important method to organize occurrence data, we develop the Sentence information. It can be defined as a process of parti- Cluster Model as a multidimensional SMM, tion, satisfying similar elements are in the same and get the solution of parameter estima- tion by EM algorithm. Based on this model, group and dissimilar elements are in different tionrbyrEMentgorithm.methedsonfthisenodel, groups. Basically there are hierarchical clustering we represent three methods for sentence and non-hierarchical clustering. Mixture models clustering, and use Rand index to evaluate are natural framework for clustering, which falls our method through experiments on corpus into the field of non-hierarchical clustering. The with comparison to the k-means algorithm, major advantage of this method compared to simi- We mainly discuss the result on aspect of larity-based clustering is the fact that it dose not word sense distinction, part-of-speech dis- require an external similarity measure, but exclu- tinction and window size choosing. sively relies on the objects occurrence statistics. IIntroduction Separable Mixture Model (SMM) is a mixture model especially for co-occurrence data (Hofmann As an illustration of using corpora in lexicography, 1998). It hypothesizes that the co-occurrence data we developed a bilingual lexicographical platform are generated from the same class, and given the to help lexicographers in making the dictionaries specific class they are conditionally independent. with more exactitude and more efficiency (Chang Thereby the joint probability distribution of SMM Baobao, 2006). With this platform, every decision is a mixture of separable component distributions. of lexicography can be made according to the real Viewing sentence as a series of co-occurrence corpora. Therefore mistakes of subjectivity can be words, SMM can be the outline of our Sentence avoided, and consequently the quality of dictionar- Clustering Model. ies can be guaranteed. When the lexicographer is The Expectation Maximization (EM) algorithm to write a certain word entry, hoping to master all is a parameter estimation method which falls into the meanings, hewill querythecorpus,and tell the the general framework of maximum-likelihood different meanings by checking the example sen- estimation(Collins, 1997), and is applied in cases tences that platform returned. However, the scale where part of the data can be considered to be in- of corpora nowadays is so large that searching re- complete, or "hidden". It is essentially an iterative sult can easily reach the amount of 10 thousands. optimization algorithm which, at least under cer- Lexicographers could not read all these sentences. tamn conditions, will converge to parameter values If we can process the searching result by clustering at a local maximum of the likelihood function. In the problem of maximum-likelihood estimation, it 978-1-4244-16 1O-3/07/$25.OO©2007IEEE 491

[IEEE 2007 International Conference on Natural Language Processing and Knowledge Engineering - Beijing, China (2007.08.30-2007.09.1)] 2007 International Conference on Natural Language

  • Upload
    baobao

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE 2007 International Conference on Natural Language Processing and Knowledge Engineering - Beijing, China (2007.08.30-2007.09.1)] 2007 International Conference on Natural Language

Clustering of Chinese Sentences Using the SMM Model

Tiansang DU Xinying XU Liang CHEN Baobao CHANGInstitute of Computational Linguistics, Peking University, Beijing, PRC

{tsdu, xuxinying, chenlianglucky, chbb}@pku.edu.cn

sentences with the same meaning, then provideAbstract them to the lexicographer group by group, or just

The purpose of this article is to research the provide with several sentences chosen in eachclustn m d bgroup, the too-much-returned-in-searching prob-

elustherin methodihbsed ohinestatisti lem can be solved. The purpose of this paper is tomodel,then deal with the Chinese sen- study how Chinese sentence can be grouped with

tence clustering problem on bilingual lexi- the background of corpus-based lexicography.cographical platform. In the view of co- Clustering is an important method to organizeoccurrence data, we develop the Sentence information. It can be defined as a process of parti-Cluster Model as a multidimensional SMM, tion, satisfying similar elements are in the sameand get the solution of parameter estima-tion by EM algorithm. Based on this model, group and dissimilar elements are in differenttionrbyrEMentgorithm.methedsonfthisenodel, groups. Basically there are hierarchical clusteringwe represent three methods for sentence and non-hierarchical clustering. Mixture modelsclustering, and use Rand index to evaluate are natural framework for clustering, which fallsour method through experiments on corpus into the field of non-hierarchical clustering. Thewith comparison to the k-means algorithm, major advantage of this method compared to simi-We mainly discuss the result on aspect of larity-based clustering is the fact that it dose notword sense distinction, part-of-speech dis- require an external similarity measure, but exclu-tinction and window size choosing. sively relies on the objects occurrence statistics.

IIntroduction Separable Mixture Model (SMM) is a mixturemodel especially for co-occurrence data (Hofmann

As an illustration of using corpora in lexicography, 1998). It hypothesizes that the co-occurrence datawe developed a bilingual lexicographical platform are generated from the same class, and given theto help lexicographers in making the dictionaries specific class they are conditionally independent.with more exactitude and more efficiency (Chang Thereby the joint probability distribution of SMMBaobao, 2006). With this platform, every decision is a mixture of separable component distributions.of lexicography can be made according to the real Viewing sentence as a series of co-occurrencecorpora. Therefore mistakes of subjectivity can be words, SMM can be the outline of our Sentenceavoided, and consequently the quality of dictionar- Clustering Model.ies can be guaranteed. When the lexicographer is The Expectation Maximization (EM) algorithmto write a certain word entry, hoping to master all is a parameter estimation method which falls intothe meanings, hewill querythecorpus,and tell the the general framework of maximum-likelihooddifferent meanings by checking the example sen- estimation(Collins, 1997), and is applied in casestences that platform returned. However, the scale where part of the data can be considered to be in-of corpora nowadays is so large that searching re- complete, or "hidden". It is essentially an iterativesult can easily reach the amount of 10 thousands. optimization algorithm which, at least under cer-Lexicographers could not read all these sentences. tamn conditions, will converge to parameter valuesIf we can process the searching result by clustering at a local maximum of the likelihood function. In

the problem of maximum-likelihood estimation, it

978-1-4244-161O-3/07/$25.OO©2007IEEE491

Page 2: [IEEE 2007 International Conference on Natural Language Processing and Knowledge Engineering - Beijing, China (2007.08.30-2007.09.1)] 2007 International Conference on Natural Language

is sometimes difficult to get the parameter estima- collected in a sample set Stion directly from the original mixture model, {(x' ,*xM ,r) :< r< L} with arbitrary or-whereas consider the mixture model as a special il(r) iAj(r)case of the EM, the problem can easily be solved. dering.This special usage of EM algorithm is just the rea- Following the maximum likelihood principle,son why we choose it to estimate the parameters in we first specify a parametric model which gener-our Sentence Clustering Method. ates words over XN ... xX@X , and then try to iden-

In this paper, we extend the SMM from binary tify the parameters which assign the highest prob-to multiple dimensions, get the solution of parame- ability to the observed sample.ter estimation by the EM algorithm, then apply the Introducing K classes Ca, the model generatesderived model to Chinese sentence clustering. In f sthe process of clustering, we investigate the influ-ence of word and part of speech separately, and we 1. choose a class Ca according to a distribu-investigate the effect of window size on the clus- tion ;zatering result as well. We use Rand Index to evalu-ate our method, and compare with k-means with 2. select a word x' C X. from a class-the same data and the same evaluation index. specific conditional distribution Pi la' for

The rest of this paper is organized as follows:Section 2 presents Sentence Clustering Model and each je {1,.,M}its EM solution. Section 3 explains our sentence For every subscript j, step 2 can be carried out in-clustering method. Section 4 introduces the evalua- dependently, hence the series of words xl xmtion method. Finally in Section 4 we will apply the i Mderived algorithm to several corpora experiments are conditionally independent given the class Ca.and discuss the results. The joint probability distribution of the multi-

dimensional SMM is a mixture of separable com-2 Sentence Clustering Model ponent distributions which can be parameterized

2.1 The Basic Model byK K M

To group all sentences returned by searching cer- P=1 a) Z ='a1tain keyword from the corpora, we suppose that The number of independent parameterseach different meaning of the keyword corresponds Mto a different sentence class, and every word that is (yNAk -M +1)K -1. The log-likelihood is asconstructs the sentence is generated from that class, k=1moreover given that specific class, word at each follows:position is generated independently. So in the view

= log

of co-occurrence data, the model we use for sen- L= iilog|IPi latence clustering is basically a multidimensional 'I=1 'A =1 a=1 = ]SMM, where the dimension denotes the number ofjoint occurrence words or part-of-speech. We will 2.2 The EM solutioncreate the Sentence Clustering Model by extendingthe SMM from binary to multiple dimensions. To optimally fit the model to the given sample set

Given dimension M, which denotes the span of S, we have to maximize the log-likelihood withwords we take from a sentence for clustering, the respect to the parameters 0 = (;r, p) . To overcomegeneral setting is as follows: Suppose M finite sets the difficulties in maximizing a log of sum, a set of

X1i = Xl },. , , *N x,xv }, unobserved variables is introduced and the corre-each set is composed of word which ever occurred sponding EM algorithm is derived. EM is a generalat that position. We consider M words which occur iterative technique for maximum likelihood esti-

togetheras a sample, i.e. 1 M\ mation, where each iteration is composed of twotogeher s asampe, ie. Xi,, ,Xi)eC steps:

X1.x* XX. All the samples are numbered and

492

Page 3: [IEEE 2007 International Conference on Natural Language Processing and Knowledge Engineering - Beijing, China (2007.08.30-2007.09.1)] 2007 International Conference on Natural Language

1. an Expectation (E) step for estimating the Y RJatunobserved data or, more generally, aver- r(-aging the complete data log-likelihood Oa L

2. a Maximization (M) step, which involves E Rrat (3)maximization of the expected log- p(l) "(r)=ij xlikelihood computed during the E-step in Pija L2i0,each iteration (t)^ (t) (t)

The EM algorithm is known to increase the likeli- RraI(t+') -af Pl(r)la ... Pi (r)la(t)

hood in every step and converges to a (local) l TapPi(r)la P... i(r)amaximum of L under mild assumptions. a

According to the definition of E-step, the com- Iterating the E and M step, the parameters con-plete data log-likelihood is given by verge to a local maximum of the likelihood.tcU Rr, (log;T,z + log Pi,(r)l, + log Pi,(r)lar + ***+ log Pi, (r)lat)

r,a +3 Sentence Clustering Methodra~~~ ~~~~~~~1

where (1) 3.1 Three Methods

Rr- P(a ) TaPi,(r)Ia .Pi, (r)ia When focus on a specific sentence clustering prob-ra

= P(a r) =_- lem, we need to fix on the size and the data type ofL. TaPi (r)1a Pi(r) a sample set.a

The variable rpsThe size of a sample is the number of objectsThe variable R represents the probability that the that we collect for clustering, which equals the di-

observation (x' ,... , ,r) e S was generated mension M. We consider that data in the neighbor-from)arclassa,Aetofheseariablsisumma hood of the keyword has equal effect on clustering

from a class Ca. A set of these variables is summa- result. The window size is half the neighborhood,rized in a matrix R E qR, where equals M/2, and it can be flexibly changed. Set the

K window size the length of the sentence, we willq= {R = (Rrna) : = 1} cluster with the whole sentence. Set the window

a=1 size relative small, we will emphasize only on thedenotes a space of all matrices that the sum of influence of a near neighborhood. For a specificevery single row equals 1. R effectively partitions assignment we can change the window size to de-the sample set S in to K classes. termining the most suitable one, i.e. the window

Now the model parameters have been enlarged size which can achieve the best clustering result.to 0 = (z, p, R), and every element of matrix R The data type of sample we used could be eithercan be calculated by the following formula in each word or part-of-speech. To explore the clusteringiteration: result, we develop three methods according to the

^(t)^ (t) ^ (t) different usage of sample set:R ( T=a Pi,(r)a. Pi(r)Ia (2) The method that only takes word as a sample is

roca (t) (t)A y;aP,il(r)la ... Pi. (r)la called method <only-word>.a The method that only takes part-of-speech as a

The M-step is obtained by differentiation of (1) sample is called method <only-pos>.using (2) as an estimate for Rraand imposing the The method that takes both the word and part-normalization constraints by the method of La- of-speech as a sample is called method <word-grange multipliers. pos>. The window size in this method will be

The iteration formula is as follows: twice as the one in the former two methods.

3.2 The EM Iteration

Given an initial R matrix randomly, the EM itera-tion is calculated according to formula (3). Aftereach iteration, we get a new R matrix, which repre-sents a soft clustering result. We define the stoprule with two conditions as follows:

493

Page 4: [IEEE 2007 International Conference on Natural Language Processing and Knowledge Engineering - Beijing, China (2007.08.30-2007.09.1)] 2007 International Conference on Natural Language

1. The difference between the two log- 4. s and s are in different clusters in PI, but inlikelihoods in the recent two iterations issmall enough the same cluster in P2.

2. The difference between the two R matri- Let a be the number of decisions where si is inces in the recent two iterations is small the same cluster as s in P and in P Let b beenough. The difference between two ma- 1trices is defined as the maximum of all the number of decisions where the two samples aredifference between the elements at the placed in different clusters in both partition.same position. a + b is the number of right connection. Total

The first condition is used to control the conver- agreement can then be calculated usinggence of the log-likelihood, and the second condi- a+btion is used to control the convergence of the pa- Rand(fb - n(n -1)/2rameter R. When the iteration is over, the biggestelement in each row ofR matrix indicates the clus- R(I, I2) = 1 if and only if our result is as theter which the specific sentence should belong to. same as the correct answer.

The time complexity is O( tkns ), where t de-notes the time of iterations, k denotes the number 5 Experimentsof clusters, n denotes the number of sentences, and All instance used in this section is chosen from thes denotes the span of the words we collect in a sen- People Daily Corpus, and the Chinese part of ourtence. Pol al ops n h hns ato uparallel corpora.4 Evaluation Method The instance we showed in the next two parts

has different emphasis. Section 5.1 emphasizesTo evaluate clustering result, we pick some spe- particularly on clustering result of polysemouscific keywords, and get the instances by searching word, and Section 5.2 emphasizes on the result ofkeyword in the corpus. To get test set, we label the word with multiple pos tags.instance into classes manually by human judge, asto compare with our automatic clustering result. 5.1 Word Sense DistinctionTo calculate agreement between our results and We choose 43 sentences with the keyword "d"

the correct labels, we make use of the Rand index, (ihf/v=43). The sentences are labeled manually intowhich allows for a measure of agreement between 6 clusters, it means it has 6 senses by human judge.two partitions, PI and P2, of the same data set S For our cluster algorithm, we set the window size(Rand, 1971). Each partition is viewed as a collec- to be 5, and the total cluster number is set to be 6,tion of n(n -1)/2 pairwise decisions, where n is the same as the human judge. Clustering result inthe size of S. For each pair of samples s and sin oneexperimentisasfollows:

i i Method<only-word>: Rand = 0.724474S, P either assigns them to the same cluster or to Method<only-pos>: Rand 0.723588different clusters. There are only four situations: Method<word-pos>: Rand 0.733998Right connection: We use the k-means algorithm (Lloyd's) to

1. si and si are in the same cluster in PI, and in compare with our method with the same evaluationindex and the same data.

the same cluster in P~. Method<k-means>: Rand = 0.5094132. s, and sj are in different clusters in PI, and Compared with the k-means, the Rand index of

in different clusters in P2. our methods is 20% higher at least.Wrong connection: 5.2 Part-of-Speech Distinction

3. s, and s are in the same cluster in P, but in We chs 04 thhe keyword "iJF'different clusters in P2. (,FJVq=100, ,FJVu=779, PJVin=25). The sentences are

labeled into 3 clusters according to the pos tags ofkeyword. The window size is set to be 3, and the

494

Page 5: [IEEE 2007 International Conference on Natural Language Processing and Knowledge Engineering - Beijing, China (2007.08.30-2007.09.1)] 2007 International Conference on Natural Language

total cluster number is set to be 3. Clustering result fluctuation. From this instance we can guess, forin one experiment is as follows: the word sense distinction problem, the words far

Method<only-word>: Rand= 0.415951 from the keyword may still have a considerableMethod<only-pos>: Rand = 0.703025 effect on the cluster result.Method<word-pos>: Rand =0.425506 In Figure 2, we can see three curves are basi-Method<k-means>: Rand= 0.738953 cally descending, but the curves of method <only-In our three methods, only the Rand index of word> and method <word-pos> are in superposi-

method<only-pos> is close to k-means, the Rand tion. The window size 3 is making the highestindex of other two methods is not that good in this Rand index. This means for the part-of-speech dis-part-of-speech distinction problem. tinction problem, the words near the keyword is

enough to ascertain its cluster, yet the words far5.3 Window Size Analysis from the keyword have limited contributions.In this Section, we will analysis the rules of win- As for a general instance, we choose 148 sen-dow size choosing. The most suitable window size tences with the keyword "iv"( ;3a=104, ;3kj=9,should be the one that makes the highest Rand in- ;3an=1, ;3d=34) . This instance has to deal withdex, which is got by several times of tying. both the problem of word sense distinction and the

Figure 1 shows the Rand index changes with the problem of part-of-speech distinction. The sen-window size of the instance "d" , which is the tences are labeled into 9 clusters. The total clusterinstance of word sense distinction we used in Sec- number is set to be 9. The following figure showstion 5.1, and Figure 2 shows the Rand index the Rand index changes with the window size:changes with the window size of the instance "E", Figure 3:which is the instance of part-of-speech distinction 0. 84we used in Section 5.2. 0. 83

Figure 1: ef 0.820. 74 0. 81

0. 73 0*80~~~~~~~~~~~~~~~~~~~~~~.7_____o___a 0.78

071 0. 770. 7 0. 76

0. 69 3 5 7 9 110. 68 only-word Window Size

3 4 5 6 7 8 9 1o 11 only-poso only-word . . word-pos

word-pos Intuitively, to this experiment, window size isnot the more the better. When window size is over

Figure 2: 7, the point doesn't make remarkable change. And0. 8 for different methods, the most suitable window(<0.7 size is different. The most suitable window size for0.6(: D | ,method <only-word> is 5, and the most suitable0.4 __ __ __ __ window size for method <only-pos> and <word-0)0.3 pos> is 3.

c 0.2 Generally speaking, the setting of window size0. 10 should depend on what the keyword is, and the

3 4 5 6 7 8 9 best window size should be got by several times of--only-word Window Size tying.only-pos

word-pos 5.4 Three Methods ComparisonAs we can see, in Figure 1, each Rand index

curve has multiple peak values, and the most no- We have used three methods to all the instances.ticeable peak value is got when window size is 5, For the instance of word sense differentiation prob-the curve between window size 4 and 10 presents lem in Section 5.1, the Rand index of three meth-

ods are quite similar, thus we can not tell which

495

Page 6: [IEEE 2007 International Conference on Natural Language Processing and Knowledge Engineering - Beijing, China (2007.08.30-2007.09.1)] 2007 International Conference on Natural Language

one is better. For the part-of-speech differentiation CHANG Baobao, 2006, A Bilingual Lexicography Plat-problem in Section 5.2, the Rand index of method form based on Corpora, Lexicographical Studies,<only-pos> is significantly higher than the other 2006, No.3 (in Chinese)two methods, the Rand index of method<word-pos> is in the middle, and the Rand index of Collins, M., 1997, The EM Algorithm. Technical report,method<only-word> is the lowest. That means for ' ' ' . .this~~~~~~~~~~intneth,ennftekyodcnb Department of Computer and Information Science,this instance, the meaning of the keywordl can be University of Pennsylvania.distinguished quite well using the part-of-speechinformation, while if only use the information ofwords the result is not comparatively lower. This Dempster, A. P., Laird, N.M., and Rubin, D.B. 1977.could be explained by the fact that the correct an- Maximum Likelihood from Incomplete Data Via theswer set is labeled according to pos tags of the EM Algorithm, Jounal of the Royal Statistical Soci-keyword, which has a closer relation to the part-of- ety.speech of words in the neighborhood. Due to theflexibility of word at each position, the degree of

distnctinbymetod<oly-wrd>is rlatiely Hofmann, T. Jan Puzicha. 1998. Statistical Models fordistinction.by method<onl-word is re.l.atin Co-occurrence Data. Memorandum, Massachusettssmall. As for the method<word-pos>, it ranks. Institute of Technology Artificial Intelligence Labo-the middle because the sample it uses is the ratory, Cambridge, Massachusetts.summrition of method<only-word> and method<only-pos>.

Rand, W. M. 1971. Objective criteria for the evaluation6 Conclusion of clustering methods. Journal of the American Sta-

tistical Association.In this paper we research the sentence clusteringmethod based on statistical model and the EM al-gorithm. We develop the Sentence Cluster Model Wagstaff, K., and et al. 2001. Constrained K-meansas a multidimensional SMM and get its parameter Clustering with Background Knowledge, In Proceed-estimation by EM algorithm. Then based on this ings of the Eighteenth International Conference onmodel, we represent three methods for sentence Machine Learning.clustering, and evaluate our method by clusteringexperiments on corpus. The result shows that com-paring with the k-means algorithm our method has Wu, C. F Jeff 1983. On the Convergence Properties ofbetter effect on word sense distinction problem. tAlthough there are mistakes in automatic cluster-ing result, considering that on the lexicographicalplatform the final decisions are made by lexicogra-phers, who can totally ignore the mistakes, there-fore the sentence clustering method we developedin this paper can be applied to the lexicographicalplatform.

AcknowledgementWe would like to give our thanks to Mr. Zhu Dan-qing and Mr. Ding Weiwei for their help on thedata preparing.This paper has been supported by Chinese SSFCproject( #06BYY048 ), and Chinese NSFC project(#60303003 ).

References

496