Probabilistic word classification based on a context-sensitive binary tree method

Computer Speech and Language (1997) 11, 307–320

Probabilistic word classification based on acontext-sensitive binary tree method

Jun Gao∗ and XiXian ChenInformation Technology Laboratory, Beijing University of Posts and Telecommunications,

P.O. Box 103, Beijing University of Posts and Telecommunications, No. 10, Xi Tu Cheng Street,Hai Dian District, Beijing, 100088, China

Abstract

A corpus-based statistical-oriented Chinese word classification can beregarded as a fundamental step for automatic or non-automatic,monolingual natural processing systems. Word classification can solvethe problems of data sparseness and have far fewer parameters. Sofar, much relative work about word classification has been done. Allthe work is based on some similarity metrics. We use average mutualinformation as the global similarity metric to do classification. Theclustering process is top-down splitting and the binary tree is growingwith splitting. In natural language, the effect of left neighbours andright neighbours of a word are asymmetric. To utilize this directionalinformation, we induce the left–right binary and the right–left binarytree to represent this property. The probability is also introduced inour algorithm to merge the resulting classes from the left–right andthe right–left binary tree. Also, we use the resulting classes to doexperiments on a word class-based language model. Some classes’results and the perplexity of a word class-based language model arepresented. 1997 Academic Press Limited

1. Introduction

Word classification plays an important role in computational linguistics. Many tasksin computational lingusitics, whether they use statistical or symbolic methods, reducethe complexity of the problem by dealing with classes of words rather than individualwords.

We know that some words share similar sorts of linguistic properties; thus, theyshould belong to the same class. Some words have several functions; thus, they couldbelong to more than one class. The questions are: what attributes distinguish one wordfrom another? How should we group similar words together so that the partition ofword spaces is most likely to reflect the linguistic properties of language? Whatmeaningful label or name should be given to each word group? These questions

∗Author for correspondence. E-mail: [email protected]

0885–2308/97/040307+14 $25.00/0/la970033 1997 Academic Press Limited

308 J. Gao and X.-X. Chen

constitute the problem of finding a word classification. At present, no method can findthe optimal word classification. However, researchers have been trying hard to findsub-optimal strategies which lead to useful classifications.

From a practical point of view, word classification addresses questions of datasparseness and generalization in statistical language models. It can especially be usedas an alternative to grammatical part-of-speech tagging (Cutting, Kupiec, Pederson &Sibun, 1992; Brill, 1993) on statistical language modelling (Huang, Alleva, Hwang, Lee& Rosenfeld, 1993; Rosenfeld, 1994), because Chinese language models using part-of-speech information have had only very limited success (e.g. Chang & Chen, 1993; Lee,Dung, Lai & Chang Chien, 1993). The reason why there are so many difficulties inChinese part-of-speech tagging are described by Chang and Chen (1995) and Zhao(1995).

Much relative work on word classification has been done. The work is based onsome similarity metrics (Bahl, Brown, DeSouza & Mercer, 1989; Brown, Pietra, DeSouza& Mercer, 1992).

Brill (1993) and Pop (1996) present a transformation-based tagging. Before a part-of-speech tagger can be built, the word classifications are performed to help us choosea set of part-of-speech. They use the sum of two relative entropies obtained fromneighbouring words as the similarity metric to compare two words.

Schutze (1995) shows a long-distance left and right context of a word as a left vectorand a right vector and the dimensions of each vector are 50. He uses cosine as themetric to measure the similarity between two words. To solve the sparseness of thedata, he applies a singular value decomposition. Comparing the method of Brill (1993),Schutze (1995) takes 50 neighbours into account for each word.

Chang and Chen (1995) proposed a simulated annealing method, as did Jardino andAdda (1993). The perplexity, which is the inverse of the probability over the wholetext, is measured. The new value of the perplexity and a control parameter Cp(Metropolis algorithm) will decide whether a new classification (obtained by movingonly one word from its class to another, both word and class being randomly chosen)will replace a previous one. Compared with the two methods described above, thismethod attempts to optimize the clustering using perplexity as a global measure.

Pereira, Tishby and Lee (1993) investigate how to factor word association tendenciesinto associations of words to certain hidden senses, classes and associations betweenthe classes themselves. More specifically, they model senses as probabilistic concepts orclusters C with corresponding cluster membership probabilities P(C|w) for each wordw. That is, while most other class-based modelling techniques for natural language relyon “hard” Boolean classes, Pereira et al. (1993) propose a method for “soft” classclustering. They suggest a deterministic annealing procedure for clustering. But, asstated in their paper, they only consider the special case of classifying nouns accordingto the distribution as direct objects of verbs.

McMahon (1994) presents a structural word classification, which is different fromthe methods described above. He uses average mutual information as the similaritymetric and treats the total words in the vocabulary as a unit at the beginning of theprocess. Then he splits the unit into two units according to average mutual information;by repeating this procedure, we will obtain the final word classifications. Since thesplitting procedure is restricted to trees, as opposed to arbitrary directed graphs, thereis no mechanism for merging two or more nodes in the tree growing process. As aresult McMahon considers only the contiguous bigram information, that is, the bigram

309Probabilistic word classification

of the form οwi−1, wiπ. Unlike the methods introduced by Brill (1993) and Schutze (1995),McMahon takes into account only the information provided by the left neighbour ofthe word.

To address the problems and utilize the advantages of the methods presented above,we put forward a new algorithm to automatically classify the words.

2. Chinese word classification method

2.1. Basic idea

We adopt the top-down binary splitting technique to all the words using average mutualinformation as the similarity metric like McMahon (1994). This method has its merits:the top-down technique can explicitly represent the hierarchy information; the positionof the word in class-space can be obtained without reference to the positions of otherwords, while the bottom-up technique treats every word in the vocabulary as one classand merges two classes among this vocabulary according to a certain similarity metric,then repeats the merging process until the required number of classes is obtained.

2.1.1. Theoretical basis

Brown et al. (1992) have shown that any classification system whose average classmutual information is maximized will lead to class-based language models of lowerperplexities.

The concept of mutual information, taken from information theory, was proposedas a measure of word association (Jelinek, Mercer & Roukos, 1990, 1992; Dagan,Marcus & Markovitch, 1995). It reflects the strength of the relationship between wordsby comparing their actual co-occurrence probability with the probability that wouldbe expected by chance. The mutual information of two events x and y is defined asfollows:

I (x1, x2)=log2P (x1, x2)

P (x1)P (x2)(1)

where P (x1) and P (x2) are the probabilities of the events, and P (x1, x2) is the probabilityof the joint event. If there is a strong association between x1 and x2, then P (x1,x2)qP (x1)P (x2) as a result and I (x1, x2)q0. If there is a weak association between x1

and x2, then P (x1, x2)≈P (x1)P (x2) and I (x1, x2)≈0. If P (x1, x2)pP (x1)P (x2), then I (x1,x2)p0. We have considered any negative value to be 0. That is, although the mutualinformation value may be negative for certain x1 and x2, the average mutual informationmust be larger than 0 or equal to 0. If the corpora are not extremely large, all possiblecombinations of x1 and x2 may not occur. Thus, the average mutual informationcalculated from this limited corpora may be negative. This result is in conflict with thetheorem. So, we set I (x1, x2) to 0 if P (x1, x2)=0.

The average mutual information Ia between events x1, x2, . . . , xN is similarly defined

Ia=]N

i=1]

N

j=1

P (xi, xj)×log2P (xi, xj)

P (xi)P (xj). (2)


Rather than estimate the relationship between words, we measure the mutual in-formation between classes. Let Ci, Cj be the classes, i, j=0, 1, 2, . . . , N ; N denotes thenumber of classes.

Then the average mutual information between classes C1, C2, . . . , CN is

Ia=]N

i=1]

N

j=1

P (Ci, Cj)×log2P (Ci, Cj)

P (Ci)P (Cj). (3)

2.1.2. The basic algorithm

The completing process is described as follows.We split the vocabulary into a binary tree, and we only consider one dimension

neighbour.

(1) Take the whole words in the vocabulary as one class and take this level in thebinary tree as 0. That is, Level=0, Branch=0, Class(Level,Branch)=VocabularySet. Then, Level=Level+1.

(2) Class(Level,Branch)=Class(Level−1,Branch/2). Old Ia=0.Class(Level,Branch+1)=empty. Select a word wivClass(Level,Branch).

(3) Move this word to Class(Level,Branch+1).Calculate the Ia(wi).

(4) Move this word back to Class(Level,Branch).If (all words in Class(Level,Branch) have been selected), then go to (5), otherwiseselect another unselected word wivClass(Level,Branch) to Class(Level,Branch+1),go to (3).

(5) Move the word having Maximum(Ia) from Class(Level,Branch) to Class(Level,Branch+1).

(6) If (Maximum(Ia)>Old Ia), then:Old Ia=Maximum(Ia). Select a word wivClass(Level,Branch), go to (3).

(7) Branch=Branch+2.If (Branch<2Level), go to (2).

(8) Level=Level+1, Branch=0;If (Level<predefined classes number), go to (2), otherwise go to end.

From the algorithm described above, we can conclude that the computation time isto be order O(hV3) for tree height h and vocabulary size V to move a word from oneclass to another.

If the height of the binary tree is h, the number of all possible classes will be 2h.During the splitting process, especially at the bottom of the binary tree, some classesmay be empty because the classes higher than them cannot be further split.

2.2. Improvement to the basic algorithm

2.1.1. Length of neighbour dimensions

As mentioned in the Introduction, Brill (1993) and McMahon (1995) only consider onedimension neighbour, while Schutze (1995) considers 50 dimension neighbours. Howlong, then, should the dimension neighbours be? For long-distance bigrams mentioned


in Huang et al. (1993) and Rosenfeld (1994), training-set perplexity is low for theconventional bigram (d=1), and it increases significantly as they move through d=2,3, 4 and 5. For d=6, . . . , 10, training-set perplexity remained at about the same level.Thus, Huang et al. (1993) conclude that some information does indeed exist in themore distant past but it is spread thinly across the entire history. We do the test onChinese in the same way and similar results are obtained. Therefore, 50 is too long fordimensions and the search in searching space is computationally prohibitive, and 1 isso small for dimensions that much information will be lost.

In this paper, we let d=2.So, P (Ci), P (Cj) and P (Ci, Cj) can be calculated as follows:

P (Ci)=

]wvCi

Nw

Ntotal

(4)

P (Cj)=

]wvCj

Nw

Ntotal

(5)

P (Ci, Cj)=

]w1vCi

]w2vCj

Nw1w2

Ntotald(6)

where Ntotal is the total number of times words that are in the vocabulary occur in thecorpus; d is the calculating distance considered in the corpus; Nw is the total numberof times that word w occurs in the corpus; and Nw1w2 is the total number of times thatwords couple w1w2 occurs in the corpus within the distance d.

2.2.2. Context-sensitive processing

Brill (1993) used the sum of two relative entropies as the similarity metric to comparetwo words. The word’s neighbours were treated equally without considering the possibledifferent influences of the left neighbour and the right neighbour to the word. But, innatural language, the effect from the left neighbour and the right neighbour is asym-metric, that is, the effect is directional. For example, in “ ” (“I ate an apple”),the Chinese word “ ” (“I”) and “ ” (“apple”) has different functions in thissentence. We cannot say that “ ” (“An apple ate I”). So, it is necessary toinduce a similarity metric which reflects this directional property. Applying this idea inour algorithm, we create two binary trees to represent different directions. One binarytree is produced to represent the word relation direction from the left to the right, andthe other is to represent the word relation direction from the right to the left. Theformer (from the left to the right) is the default circumstance mentioned in Section2.1.2.

A similar idea about directional property is also presented by Dagan et al. (1995).They define a similarity metric of two words that can reflect the directional propertyaccording to mutual information to determine the degree of similarity between twowords. But the metric does not have transitivity. That is, although they concluded that


a similarity metric was necessary to measure the similarity of two words, it is directional.Dagan et al. (1995) cannot provide a mechanism to mix these directional characteristicsin clustering words to equivalent classes. They only use this metric to solve the datasparseness of the training corpus.

To reflect the different influence of the left neighbour and the right neighbour of theword, we introduce the probability for each word w to every class. That is, for theclasses produced by the binary tree that represent the word relation direction from theleft to the right, we distribute probability Plr(Ci |w) for each word w corresponding toevery class Ci, so that the probability Plr(Ci |w) reflects the degree to which the word wbelongs to class Ci. For the classes in which the binary tree represents the word relationdirection from the right to the left, Prl(Ci |w) is calculated likewise.

Mutual information can be explained as the ability of dispelling the uncertainty ofthe information source. And the entropy of information is defined as the uncertaintyof the information source. So, the probability of the word w that belongs to class Ci

can be presented as follows:

Pci=I (Ci, Ci)H (Ci)

(7)

where I (Ci, Ci) is the mutual information between the classes Ci and the Ci that are inthe same binary branch. H (Ci) is the entropy of class Ci. Equation (7) is derivedaccording to the ideas that are described as follows: if we successfully move a word wfrom Ci to Ci, the word w can provide mutual information I (Ci, Ci) for the class Ci

and the total information or entropy of the class Ci is H (Ci). Therefore, Pci can be usedto represent the probability of the word w belonging to the class Ci. All words whichbelong to the class Ci are distributed with the same probability Pci and this probabilityis used to multiply its previous value, as described below.

So, 1−Pci denotes the probability that word w did not belong to the class Ci. Thatis, in the binary tree, 1−Pci denotes the probability of the other branch class cor-responding to Ci. The average mutual information is little and, therefore, it is possiblethat Pci is less than 1−Pci. To avoid distributing the lesser probability to the wordassigned to this class and the higher probability to the word not assigned to this class,we distribute the probability 1 to the word assigned to the class.

Thus, for each class in a certain level of the binary tree, we multiply the probabilities,either 1 or 1−Pci, with its original probabilities, in which Ci is the other branch classopposite to the class the word does not belong to.

The description above considers only the word w belonging to a certain class in acertain level without considering the effect from its upper levels. To obtain the realprobability of the word w belonging to a certain class, all the probabilities belongingto its ancestors should be multiplied together.

The distribution of the probability is not optimal, but it reflects the degree to whicha word belongs to a class. It should be noted that Ri P (Ci |w) must be normalized bothfor the left–right and the right–left results, and the normalized results of the left–rightand the right–left binary tree must also be normalized together.

2.2.3. Probabilistic bottom-up classes merging

Since there is a directional property between words, the transitivity will not be satisfiedbetween different directions. That is, if we did not introduce the probabilities Plr(Ci |w)


and Prl(Ci |w), we would not merge the classes because there is no transitivity betweenthe class in which word relation is from the left to the right and the class in whichword relation is from the right to the left. That is, because it is directional, if we cannotprovide a mechanism to mix this hard directional characteristic, we will not clusterwords to equivalent classes. For example, “ ” (“we”) and “ ” (“you”) arecontained in one class derived by the left–right binary tree, and the other two words“ ” (“you”) and “ ” (“apple”) belong to another class derived from the right–leftbinary tree. This does not mean that the words “ ” (“we”) and “ ” (“apple”)belong to one class, but when we put forward the probability, unlike the intransitivityof the similarity metric presented by Dagan et al. (1995), the classes generated by twobinary trees can be merged because the probabilities can make the “hard” intransitivity“soft”.

Athough this top-down splitting method has the advantages mentioned above, it hasits obvious shortcomings. Magerman (1994) describes these shortcomings in detail.Since the splitting procedure is restricted to be trees, as opposed to arbitrary directedgraphs, there is no mechanism for merging two or more nodes in the tree growingprocess. That is to say, if we distribute the words to the wrong classes from the globalsense, we will no longer be able to move it back. So, it is difficult to merge the classesobtained by the left–right binary tree and the right–left binary tree during the processof growing the tree. To solve this problem, we adopt the bottom-up merging methodto the resulting classes.

A number of different similarity measures can be used. We choose to use relativeentropy, also known as the Kullback–Leibler distance (Brill, 1993; Pereira et al., 1993).Rather than merge two words, we merge the two classes which belong to the resultingclasses generated by the left–right binary tree and the right–left binary tree, respectively,and select the merged class that can lead to the maximum value of the similarity metric.This procedure can be done recursively until the required number of classes is reached.

Let P and Q be the probability distribution. The Kullback–Leibler distance from Pto Q is defined as:

D (P ∀Q )=]x

P (x) logP (x)Q (x)

. (8)

The divergence of P and Q is then defined as:

Div(P, Q )=Div(Q, P )=D (P ∀Q )+D (Q ∀P ). (9)

For two words w and w1, let Pd(w, w1) be the probability of the word w occurring tothe left of w1 within the distance d. The probabilities Pd(w1, w), Pd(w, w2) and Pd(w2, w)are defined likewise. And let Plr(Ci |w1), Plr(Ci |w2), Prl(Cj |w1) and Prl(Cj |w2) be theprobabilities of words w1 and w2 contained in the classes Ci and Cj in the left–right andthe right–left trees, respectively. R is the value between (0, 0·5). Finally, let:

Dlr(w1, w2)=Div[Pd(w, w1), Pd(w, w2) ] !wvVocabulary

Dlr(w1, w2) shows the difference between all w’s influence on the words w1 and w2 whenthey are to its right, respectively. That is, Dlr(w1, w2) is used to measure how much


difference there is between w1 and w2 using the influence coming from their left. Let:

Drl(w1, w2)=Div[Pd(w1, w), Pd(w2, w) ] !wvVocabulary

Drl(w1, w2) is used to measure the difference between w1 and w2 using the influencecoming from their right.

Let

ITlr=wMaxw

1,w

2

[Dlr(w1, w2) ]−Minw

1,w

2

[Dlr(w1, w2) ]xR

ITrl=wMaxw

1,w

2

[Drl(w1, w2) ]−Minw

1,w

2

[Drl(w1, w2) ]xR

where ITlr represents the criteria of calculating the weight of Dlr(w1, w2) and ITrl representsthe criteria of calculating the weight of Drl(w1, w2).

The value of R can be determined by us. It shows how much the influence comingfrom the left and the right of the word is allowed to be. The bigger the value of R, thegreater are the influences coming from the left and the right of the word. With thedifference in size of corpus, the value of R must be different. For the small corpus, thetraining data are sparse. To utilize these data efficiently, we allow R to become bigger.Thus, more influences from the left and the right are induced. When the corpus is verylarge, the training data can provide efficient information from the left and right ofevery word and, therefore, the value of R should be little in order to induce lessinfluence. Thus, there is no best weight for all corpora. To be convenient, in ourexperiment, we define the value of R as 0·25.

Min[Pd(w, w1)Plr(Ci |w1), Pd(w, w2)Plr(Ci |w2) ]

if Dlr(w1, w2)≤Minw

1,w

2

[Dlr(w1, w2) ]+ITlr

flr(Ci, w, w1, w2)= Max[Pd(w, w1)Plr(Ci |w1), Pd(w, w2)Plr(Ci |w2) ]

if Dlr(w1, w2)≥Maxw

1,w

2

[Dlr(w1, w2) ]+ITlr

12[Pd(w, w1)Plr(Ci |w1)+Pd(w, w2)Plr(Ci |w2) ] else

{and frl(Cj, w, w1, w2) can be defined similarly.

We can then define the similarity of w1 and w2 as:

S(w1, w2)=1−12 G]

wvV

flr(Ci, w, w1, w2)Dlr(w1, w2)

+]wvV

frl(Cj, w, w1, w2)Drl(w1, w2)H (10)

where V is the vocabulary.


The factors which influence the weight of Dlr(w1, w2), that is, flr(Ci, w, w1, w2), comesfrom two points: one is the relationship of w and w1, w and w2, the other is how wellthe words w1 and w2 are placed in Ci. When Dlr(w1, w2) is less than a certain value, itdemonstrtaes that the words w1 and w2 are similar enough and Dlr(w1, w2) should haveless weight in order to meet Equation (10). Therefore, we choose the minimum productof Pd(w, w1)Plr(Ci |w1) and Pd(w, w2)Plr(Ci |w2). Alternatively, when Dlr(w1, w2) is largerthan a certain value, it shows that w1 and w2 are not similar enough and Dlr(w1, w2)should be given a larger value. frl(Cj, w, w1, w2) can be similarly explained.

We choose two classes Ci and Cj to merge, which maximize the sum of the similarityvalues and the merge is repeated until a certain number of steps are reached. S(w1, w2)ranges from 0 to 1, with S (w, w)=1.

The computation cost of this similarity is not high, as the components of Equation(10) have been obtained during the early computation.

The number of all possible classes is 2h. During the splitting process, especially atthe bottom of the binary tree, some classes may be empty because the classes at ahigher level than them cannot be further split according to the rule of maximum averagemutual information. The number of the resulting classes cannot be accurately controlled,thus we can define the number of the demanding classes in advance. As long as thenumber of the resulting classes is less than the predefined number, the splitting processwill be continued. When the number of the resulting classes is larger than the predefinednumber, we use the merging technique presented above to reduce the number until itis equal to the predefined number. The procedure can be described as follows: after wehave merged two classes taken from the left–right and the right–left trees, respectively,we use this merged class to replace the two original classes. Then we repeat this processuntil a certain step is reached. In this paper, we define the number of steps as equal tothe larger number of the classes between two trees’ resulting classes. Finally, we mergeall the resulting classes until the predefined number is reached.

This merging process guarantees the probability to be non-zero whenever the worddistributions are. This is a useful advantage compared with agglomerative clusteringtechniques that are needed to compare individual objects being considered for grouping.

3. Experimental results and discussion

3.1. Word classification results

We used a Pentium 586/133 MHz, 32 M memory for the calculations. The OS wasWindows NT 4,0, and Visual C++ 4.0 was our programming language.

We used the electric news corpus called “Chinese one hundred kinds of newspapers—1994”. The total size of it is 780 million bytes. It was not feasible to do classificationexperiments on this original corpus, so we extracted a part of it that contained thenews published in April 1994 from the original news texts.

For convenience the sentence boundary markers {! , . ? “ ” ; : `} were replaced byonly two sentence boundary markers: “!” and “.”; denoting the beginning and end ofthe sentence of word phrase, respectively.

The texts are segmented by the variable-distance algorithm (Gao & Chen, 1996).We select four subcorpora which contains 10 323, 17 451, 25 130 and 44 326 Chinese

words. The vocabulary contains 2103, 3577, 4606 and 6472 words correspondingly.


T I. Summarization of classifying four subcorpora

Left–right Right–left PredefinedWords in resulting resulting number of Time for Time forthe corpus classes classes classes left–right right–left

10 323 161 178 150 22 h 19 h17 451 347 323 300 2·4 d 1·1 d25 330 642 583 500 5·1 d 2·7 d44 026 1225 1118 600 9 d 4 d

T II. The number of empty classes

Level Level Level Level Level Level Level Level Level1 2 3 4 5 6 7 8 9

Empty class 0 0 1 3 7 21 63 123 351

These subcorpora are independent. The results of the classification without introducingprobabilities is summarized in Table I.

The computation of the merging process is only equal to the splitting calculation inone level in the tree. Table I shows that the computation time for right–left is surprisinglymuch shorter than the time for left–right. But this is reasonable. In the process ofleft–right, the left branch contains more words than the right branch. To move eachword from the left branch to the right branch, we need to match this word throughoutthe corpus. But when we do the process of right–left, the left branch has less wordsthan the right. We only need to match the small number of words in the corpus. Fromthis, we can conclude that the preprocessed procedure costs much time. The numberof empty classes increases while the tree grows. Table II shows the number of emptyclasses in different levels in the left–right tree when we process the subcorpora containing10 323 words.

Although our method calculates distributional classification, it still demonstrates thatit has powerful part-of-speech functions.

Some typical word classes which is part of the results of a subcorpus containing17 451 words are listed below (resulting classes of the left–right binary tree).

However, some of the classes present no obvious part-of-speech category. Most ofthem contain only a very small number of words. This may be caused by the predefined


classification number. Thus, excessive or insufficient classification may be encountered.And another shortcoming is that a small number of words in almost every resultingclass does not belong to the part-of-speech categories that belong to most of the wordsin that class.

We use the algorithm described by McMahon (1994) as a basic line for comparison.Actually, McMahon’s algorithm is the subset of our algorithm. We extend its idea inseveral areas. The experimental result about the number of the classes and the calculatingtime of McMahon’s algorithm on four subcorpora can be obtained from Table I. Wecan see that the running time of our algorithm is longer than McMahon’s—some ofthe word classes that are from a subcorpus containing 17 451 words are listed as follows:

Compared with our algorithm, McMahon’s algorithm cannot obtain the well-dis-tributed results. Some classes are very large and some are very small. Thus, only thesmall classes appear to be the part-of-speech category.

3.2. Use of word classification results in statistical language modelling

A word class-based language model is more competitive than a word-based languagemodel. It has far fewer parameters, thus it makes better use of the training data tosolve the problem of data sparseness. We compare a word class-based N-gram languagemodel with a typical N-gram language model using perplexity.

Perplexity (Jelinek, 1990a; McCandless, 1994) is an information-theoretic measurefor evaluating how well a statistical language model predicts a particular test set. It isan excellent metric for comparing two language models because it is entirely independentof how each language model functions internally, and also because it is very simple tocompute. For a given vocabulary size, a language model with lower perplexity ismodelling language more accurately, which will generally correlate with lower errorrates during speech recognition.

Perplexity is derived from the average log probability that the language model assignsto each word in the test set:

H=−1N×]

N

i=1

log2 P (wi |w1, . . . , wi−1) (11)

where w1, . . . , wN are all words of the test set constructed by listing the sentences ofthe test set end to end, separated by a sentence boundary marker. The perplexity isthen 2H; H may be interpreted as the average number of bits of information needed to


compress each word in the test set given that the language model is providing us withinformation.

We compare the perplexity result of the N-gram language model with a class-basedN-gram language model. The perplexities PP of the N-gram for word and class are:

Unigram for word:

expA−1N]

N

i=1

ln(P (wi) )B (12)

Bigram for word:

expA−1N]

N

i=1

ln(P (wi |wi−1) )B (13)

Bigram for class:

expA−1N]

N

i=1

ln(P (wi |C (wi) )P (C (wi) |C (wi−1) ) )B (14)

where wi denotes the ith word in the corpus and C (wi) denotes the class assigned towi. N is the number of words in the corpus. P (C (wi) |C (wi−1) ) can be estimated by:

P (C (wi) |C (wi−1) )=P (C (wi), C (wi−1) )

P (C (wi−1) )(15)

where

P (C (wi), C (wi−1) )=]N

i=1

P (C (wi) |wi)P (wi, C (wi−1) )

=]N

i=1

P (wi)P (C (wi) |wi)P (C (wi−1) |wi)

P (C (wi−1) )=]N

i=2

P (wi−1)P (C(wi−1) |wi−1)

The detailed explanation for the derivation of P (C (wi), C (wi−1) ) is described inPereira et al. (1993).


T III. Perplexity comparison between N-gram for word and N-gram for class

10 323 17 451 25 130 44 326Subcorpus size words words words words

Vocabulary size 2103 3577 4606 6472OOV (%) 13·62 60·02 81·54 163·7PP of unigram 293·4 734·1 1106·3 1757·7PP of bigram 198·9 220·6 427·5 704·2PP of McMahon’s class bigram 159·5 171·6 350·1 573·6

(left–right)PP of class bigram (hard) 147·5 153·2 314·3 525·4Perplexity of class bigram (soft) 119·6 140·7 243·8 454·7

The perplexities PP based on a different N-gram for word and class are presentedin Table III.

Note that we present “hard” classification and “soft” classification results in theword class-based language model, respectively. For probabilistic classification, we definethe word as belonging to a certain class in which this word has the largest probability.

The training corpus contains more than 12 000 Chinese words. The lexicon of thistraining corpus contains 3034 words. We use the four randomly selected subcorporamentioned above as test sets.

Out of vocabulary (OOV) is a standard that is used to measure how much differencethere is between the training corpus and test sets. It can be calculated by using thenumber of words that are out of vocabulary and dividing the number of words in thevocabulary.

An arbitrary non-zero probability is given to all Chinese words and symbols that donot exist in the vocabulary. We set P (w)=(1/2N ) to the words w which are not in thevocabulary. N is the number of words in the training corpus.

From Table III we can see that the perplexity of the “hard” class-based bigram is28·7% lower than the word-based bigram and is 11·76% lower than McMahon’s classbigram, while the perplexity of the “soft” class-based bigram is much lower than the“hard” class-based bigram; perplexity reduction is about 43% when compared with the“hard” class-based bigram for maximum perplexity of each circumstance.

With the increase of OOV, the prognosis ability of the training set to the testing setis decreased; therefore, the increase of perplexity is large with the change of OOV.

4. Conclusions

In this paper we show a new method for Chinese word classification that can also beapplied in multiple languages. It integrates top-down and bottom-up ideas in wordclassification. Thus, top-down splitting techniques can learn from the bottom-up idea’sstrong points to offset its obvious weakness and keep the advantage of itself. Thismethod, unlike other classification methods, takes the context-sensitive information,which most classification methods do not consider, into account and makes it reflectthe properties of natural language more clearly. Moreover, the probabilities are assignedto the words to demonstrate how well a word belongs to classes. This property is veryuseful in word class-based language modelling used in speech recognition, for it allows


the system to have several powerful candidates to be matched during recognition.It is, however, important to consider the limitations of the method. The computational

cost is very high. The algorithm’s complexity is cubic when we move one word fromone class to another. Also, the probabilities the words assign to each class is not globallyoptimal. It reflects the degree of a word belonging to classes approximately and excessiveor insufficient classification may occur because the class number is fixed artificially.

References

Bahl, L. R., Brown, P. F., DeSouza, P. V. & Mercer, R. L. (1989). A tree-based statistical language modelfor natural language speech recognition. IEEE Transactions on ASSP 37(7), 1001–1008.

Brill, E. (1993). A Corpus-based Approach to Language Learning. Ph.D. Thesis, University of Pennsylvania.Brown, P., Della Pietra, V., DeSouza, P., Lai, J. & Mercer, R. (1992). Class-based n-gram models of

natural language. Computational Linguistics 18, 467–469.Chang, C.-H. & Chen, C.-D. (1993). A study on integrating Chinese word segmentation and part-of-

speech tagging. Communications of COLIPS 3(1), 69–77.Chang, C.-H. & Chen, C.-D. (1995). A study on corpus-based classification of Chinese words.

Communications of COLIPS 5(1&2), 1–7.Cutting, D., Kupiec, J., Pederson, J. & Sibun, P. (1992). A practical part-of-speech tagger. Applied Natural

Language Processing, pp. 133–140. Trento, Italy.Dagan, I., Marcus, S. & Markovitch, S. (1995). Contextual word similarity and estimation from sparse

data. Computer Speech and Language 9, 123–152.DeRose, S. (1988). Grammatical category disambiguation by statistical optimization. Computational

Linguistics 14, 31–30.Gao, J. & Chen, X. X. (1996). Automatic word segmentation of Chinese texts based on variable distance

method. Communications of COLIPS 6(2), 87–96.Huang, X.-D., Alleva, F., Hon, H.-W., Hwang, M.-Y., Lee, K.-F. & Rosenfeld, R. (1993). The SPHINX-II

speech recognition system: an overview. Computer Speech and Langauge 2, 137–148.Hughes, J. (1994). Automatically Acquiring a Classification of Words. Ph.D. Thesis, School of Computer

Studies, University of Leeds.Jardino, M. & Adda, G. (1993). Automatic word classification using simulated annealing. Proceedings of

ICASSP-93, pp. II:41–44. Minneapolis, Minnesota, U.S.A.Jelinek, F., Mercer, R. & Roukos, S. (1990). Classifying words for improved statistical language models.

IEEE Proceedings of ICASSP-90, pp. 621–624.Jelinek, F., Mercer, R. & Roukos, S. (1992). Principles of lexical language modeling for speech

recognition. Advances in Speech Signal Processing, pp. 651–699. Mercer Pekker, Inc.Lee, H.-J., Dung, C.-H., Lai, F.-M. & Chang Chien, C.-H. (1993). Applications of Markov language

models. Proceedings of Wordshop on Advanced Information Systems. Hsinchu, Taiwan.Magerman, D. M. (1994). Natural Language Parsing as Statistical Pattern Recognition. Ph.D. Thesis,

Stanford University.McCandless, M. K. (1994). Automatic Acquisition of Language Models for Speech Recognition. M.Sc.

Thesis, Massachusetts Institute of Technology.McMahon, J. (1994). Statistical Language Processing Based on Self-Organising Word Classification. Ph.D.

Thesis, The Queen’s University of Belfast.Pereira, F., Tishby, N. & Lee, L. (1993). Distributional clustering of English words. Proceedings of the

Annual Meeting of the ACL, pp. 183–190.Pop, M. (1996). Unsupervised Part-of-Speech Tagging. The Johns Hopkins University, ftp.//ftp.cs.jhu.edu/

pub/brill/programs/Rosenfeld, R. (1994). Adaptive Statistical Language Modeling: A Maximum Entropy Approach. Ph.D.

Thesis, Carnegie Mellon University.Schutze, H. (1995). Distributional part-of-speech tagging. EACL95, http://xxx.lanl.gov/archive/cmp-lg/

9503009.Zhao, G. (1995). Lexical classification of Chinese for NLP. Communications of COLIPS 5, 59–64.

(Received 20 February 1997 and accepted for publication 30 October 1997)

Documents

Probabilistic word classification based on a context-sensitive binary tree method