A Compression-Based Algorithm for Chinese Word Segmentation

8/3/2019 A Compression-Based Algorithm for Chinese Word Segmentation

1/20

A C o m p r e s s io n - b a s e d A l g o r it h m fo rChi nese Word Segmentat i onW. J. Teahan*The Robert Gordon UniversityR o d g e r M c N a b tUniver sity of Waikato

Y i n g y i n g W e n tUniversity of WaikatoIan H. Wit ten*Universi ty of Waikato

Chinese is written witho ut using spaces or other word delimiters. Alth oug h a text m ay be thoughtof as a corresponding sequence o f words, there is considerable am biguity in the placement ofboundaries. Interpreting a text as a sequence of words is beneficial or some information retrievaland storage tasks:for exam ple,fu ll-text search, word-based compression, and keyphrase extraction.We describe a scheme that infers appropriate positions for word boundaries using an adaptivelanguage mod el that is standard in tex t compression. It is trained on a corpus of presegmented text,and when applied to new text, interpolates word boundaries so as to maximize the compressionobtained. This simple and general m ethod performs we ll with respect to specialized schemes fo rChinese language segmentation.1 . I n t r o d u c t i o nLanguages such as Chinese and Japanese are written without using any spaces orother word delimiters (except for punctuation marks)--indeed, the Western notionof a word boundary is literally alien (Wu 1998). Nevertheless, words are present inthese languages, and Chinese words often comprise several characters, typically two,three, or four--five-character words also exist, but they are rare. Many characters canstand alone as words in themselves, while on other occasions the same character isthe first or second character of a two-character word, and on still others it participatesas a component of a three- or four-character word. This phenomenon causes obviousambiguities in word segmentation.Readers unfamiliar with Chinese can gain an appreciation of the problem of mul-tiple interpretations from Figure 1, which shows two alte rnative interpretations of thesame Chinese character sequence. The text is a joke that relies on the ambiguity ofphrasing. Once upon a time, the story goes, a man set out on a long journey. Beforehe could return home the rainy season began, and he had to take shelter at a friend'shouse. But he overstayed his welcome, and one day his friend wrote him a note: thefirst line in Figure 1. The intended interpretation is shown in the second line, whichmeans "It is raining, the god would like the guest to stay. Although the god wantsyou to stay, I do not!" On seeing the note, the visitor took the hint and prepared toleave. As a joke he amended the note with the punctuation shown in the third line,which leaves three sentences whose meaning is totally different-- "The rainy day, thestaying day. Would you like me to stay? Sure!"

* S c h o ol o f C o m p u t i n g a n d M a t h e m a t i c a l S c ie n c e s, T h e R o b e r t G o r d o n U n i v e r s it y , A b e r d e e n , S c o t l a n dt C o m p u t e r S c i e n ce , U n i v e r s i t y o f W a i k a to , H a m i l t o n , N e w Z e a l a n d

(~) 2 0 00 A s s o c i a t i o n fo r C o m p u t a t i o n a l L i n g u i s t i c s


2/20

Co mp u ta t iona l L ingu i s t ics Vo lume 26, N um ber 3

A s e n t e n c e i n C h i n e s eI n t e r p r e t a t i o n 1I n t e r p r e t a t io n 2

Figure 1A C h i n e s e s e n te n c e w i t h a m b i g u i t y o f p h r a si n g .A s e n t e n c e i n C h i n e s eI n t e r p r e t a t i o n 1I n t e r p r e t a t i o n 2

Figure 2A n e x a m p l e t h a t c a n b e s e g m e n t e d i n t w o d i f f e r e n t w a y s .

physics theo ry s ch ootevidence I products I barber I stu dy\ / J J

~ - - sci ence ~ - - c red itp r i c e 9 N ~ b o d Y t - ~ rea son ! , ~ su bje ctimage . . . und erstan d , . . student . . .

physics/

\physicis t

Figure 3E xam ple o f t r ea t ing each charac te r i n a que ry as a wo rd .

T h i s e x a m p l e r el ie s o n a m b i g u i t y o f p h r a s i n g , b u t t h e s a m e k i n d o f p r o b l e m c a na r i s e w i t h w o r d s e g m e n t a t i o n . F i g u r e 2 s h o w s a m o r e p r o s a i c e x a m p l e . F o r t h e o r -d i n a r y s e n t e n c e o f t h e f i rs t l in e , t h e r e a r e t w o d i f f e r e n t i n t e r p r e t a t i o n s d e p e n d i n g o nt h e c o n t e x t o f t h e s e n t en c e : " I li ke N e w Z e a l a n d f l o w e r s " a n d " I l ik e f r e sh b ro c c o l i"r e s p e c t i v e l y .

T h e f a c t t h a t m a c h i n e - r e a d a b l e C h i n e s e t e x t is in v a r i a b l y s t o r e d i n u n s e g m e n t e df o r m c a u s e s d i f f i c u lt y i n a p p l i c a t i o n s t h a t u s e t h e w o r d a s t h e b a s ic u n i t . F o r e x a m p l e ,s e a r c h e n g i n e s i n d e x d o c u m e n t s b y s t o r i n g a li st o f t h e w o r d s t h e y c o n t a i n , a n d a l l o wt h e u s e r t o r e t r i e v e a ll d o c u m e n t s t h a t c o n t a i n a s p e c i fi e d c o m b i n a t i o n o f q u e r y t e r m s .T h i s p r e s u p p o s e s t h a t th e d o c u m e n t s a r e s e g m e n t e d i n t o w o r d s . F a i l u r e t o d o s o, a n dt r e a t i n g e v e r y c h a r a c t e r a s a w o r d i n it se l f, g r e a t l y d e c r e a s e s t h e p r e c i s i o n o f r e t r i e v a ls i nc e la r g e n u m b e r s o f e x t r a n e o u s d o c u m e n t s a r e r e t u r n e d t h a t c o n t a i n c h a r ac t e r s, b u tn o t w o r d s , f r o m t h e q u e r y .F i g u r e 3 i l l u s t r a t e s w h a t h a p p e n s w h e n e a c h c h a r a c t e r i n a q u e r y i s t r e a t e d a s as i n g le - c h a r a c te r w o r d . T h e i n t e n d e d q u e r y i s " p h y s i c s " o r " p h y s i c i s t . " T h e f i rs t c h a r a c-t e r r e t u r n s d o c u m e n t s a b o u t s u c h t h i n g s a s " e v i d e n c e , " " p r o d u c t s , " " b o d y , . . . i m a g e , "" p r i c e s "; w h i l e t h e s e c o n d r e t u r n s d o c u m e n t s a b o u t " t h e o r y , . . . b a r b e r ," a n d s o o n .T h u s m a n y d o c u m e n t s t h a t a re c o m p l e t e ly ir r e l e v a n t t o t he q u e r y w i l l b e r e t u r n e d ,c a u s i n g t h e p r e c i s i o n o f i n f o r m a t i o n r e t r ie v a l t o d e c r e a s e g r e a t ly . S i m i l a r p r o b l e m so c c u r i n w o r d - b a s e d c o m p r e s s i o n , s p e e c h r e c o g n i t i o n , a n d s o o n .

376


3/20

Teahan, Wen, McNab, and Witten Chinese Word Segmentation

It is true that most search engines allow the user to search for multiword phrasesby enclosing them in quotation marks, and this facility could be used to search formulticharacter words in Chinese. This, however, runs the risk of retrieving irrelevantdocuments in which the same characters occur in sequence but with a different in-tended segmentation. More importantly, it imposes on the user an artificial requi rementto perform m anua l segmentation on each full-text query.Word segmentation is an important prerequisite for such applications. However,it is a difficult and i ll-defined task. Accord ing to Sproat et al. (1996) and Wu an d Fung(1994), experiments show that only about 75% agreement between native speakers isto be expected on the "correct" segmentation, and the figure reduces as more peoplebecome involved.This paper describes a general scheme for segmenting text by infe rring the positionof word boundaries, thus supplying a necessary preprocessing step for applicationslike those mentioned above. Unlike other approaches, which involve a dictionary oflegal words and are therefore language-specific, it works by using a corpus of already-segmented text for training and thus can easily be retargeted for any language forwhich a suitable corpus of segmented material is available. To infer word boundaries, ageneral adaptive text compression technique is used that predicts upcoming characterson the basis of their preceding context. Spaces are inserted into positions where theirpresence enables the text to be compressed more effectively. This approach meansthat we can capitalize on existing research in text compression to create good modelsfor word segmentation. To build a segmenter for a new language, the only resourcerequired is a corpus of segmented text to train the compression model.The structure of this paper is as follows: The next section reviews previous workon the Chinese segmentation problem. Then we explain the operation of the adaptivetext compression technique that will be used to predict word boundaries. Next weshow how space insertion can be viewed as a problem of hidden Markov modeling,and how higher-order models, such as the ones used in text compression, can beemployed in this way. The following section describes several experiments designedto evaluate the success of the new wor d segmenter. Finally we discuss the applicat ionof language segmentation in digital libraries.

Our system for segmenting Chinese text is available on the World Wide Web athtt p:/ /www.nzdl .org /cg i-b in/ congb. It takes GB-encoded input text, which can be cutfrom a Chinese document and pasted into the i nput window. 1 Once the segmenter hasbeen invoked, the result is rewritten into the same window.2 . P r e v i o u s M e t h o d s f o r S e g m e n t i n g C h i n e s eThe problem of segmenting Chinese text has been studied by researchers for manyyears; see Wu and Tseng (1993) for a detailed survey. Several different algorithmshave been proposed, which, generally speaking, can be classified into dictionary-basedand statistical-based methods, al though other techniques that involve more linguisticinformation, such as syntactic and semantic knowledge, have been reported in the -natural language processing literature.Cheng, Young, and Wong (1999) describe a dictionary-based method. Given adictionary of frequently used Chinese words, an i nput string is compared with wordsin the dictionary to find the one that matches the greatest number of characters of the

1 To enable proper viewing,and input, of GB-encodedcharacters, an appropriate version of NetscapeCommunicator or Internet Explorer must be used.

377


4/20

Computational Linguistics Volume 26, Number 3

input. This is called the maximum forward match heuristic. An alternative is to workbackwards through the text, resulting in the maximum backward match heuristic.It is easy to find situations where these fail. To use an English example, forwardmatching fails on the input "the red . . . " (it is misinterpreted as "there d ... "), whilebackward matching fails on text ending ". .. his car" (it is misinterpreted as ". .. hiscar'). Analogous failures occur with Chinese text.Dai, Khoo, and Loh (1999) use statistical methods to perform text segmentation.They concentrate on two-character words, because two characters is the most com-mon word length in Chinese. Several different notions of frequency of characters andbigrams are explored: relative frequency, document frequency, weighted documentfrequency, and local frequency. They also look at both contextual and positional in-formation. Contextual information is found to be the single most important factorthat governs the probability that a bigram forms a word; incorporating the weighteddocu ment frequency can improve the model significantly. In contrast, the positionalfrequency is not found to be helpful in determining words.Ponte and Croft (1996) introduce two models for word segmentation: word-basedand bigram models. Both utilize probabilistic automata. In the word-based method, asuffix tree of words in the lexicon is used to initialize the model. Each node is associatedwith a probability, which is estimated by segmenting training text using the longestmatch strategy. This makes the segmenter easy to transplant to new languages. Thebigram model uses the lexicon to initialize probability estimates for each bigram, andthe probability with which each bigram occurs, and uses the Baum-Welch algorithm(Rabiner 1989) to updat e the probabilities as the train ing text is processed.

Hockenmaier and Brew (1998) present an algorithm, based on Palmer' s (1997) ex-periments, that applies a symbolic machine learning technique --transform ation-basederror-driven learning (Brill 1995)--to the problem of Chinese word segmentation. Us-ing a set of rule templates and four distinct initial-state annotators, Palmer concludesthat the learning technique works well. Hockenmaier and Brew investigate how per-formance is influenced by different rule templates and corpus size. They use threerule templates: simple bigram rules, trigram rules, and more elaborate rules. Theirexperiments indicate that training data size has the most significant influence on per-formance. Good performance can be acquired using simple rules only if the trainingcorpus is large enough.

Lee, Ng, and Lu (1999) have recently introduced a new segmentation method fora Chinese spell-checking application. Using a dictionary with single-character wordoccurrence frequencies, this scheme first divides text into sentences, then into phrases,and finally into words using a small number of word combinations that are conditionedon a heuristic to avoid delay during spell-checking. When compared with forwardmaximum matching, the new method resolves more than 10% more ambiguities, butenjoys no obvious speed advantage.The way in which Chinese characters are used in names differs greatly from theway they are used in ordinary text, and some researchers, notab ly Sproat et al. (1996),have established special-purpose recognizers for Chinese names (and translated for-eign names), designed to improve the accuracy of automatic segmenters by treatingnames specially.2 Chinese names always take the form f a m i l y n a m e followed by g ivenname . Whereas family names are limited to a small group of characters, given namescan consist of any characters. They normally comprise one or two characters, but

2 In English there are significantdifferencesbetween the frequencydistributionof letters in names andin words--for example, compare the size of the T section of a telephone directorywith the size of the Tsection of a dictionary--butsuch differencesare far more pronounced in Chinese.

378


5/20

Teahan, W en, M cN ab, and W itten Chinese W ord Segmentation

t h r e e- c h a r ac t e r n a m e s h a v e a r i s e n in r e c e n t y e a r s t o e n s u r e u n i q u e n e s s w h e n t h e f a m -i l y n a m e i s p o p u l a r - - s u c h a s S m i t h o r J o n e s in E n g li s h. S p r o a t e t a l. (1 99 6) i m p l e m e n ts p e c ia l r e c o g n iz e r s n o t o n l y f o r C h i n e s e n a m e s a n d t r a n sl i te r a t e d f o r e ig n n a m e s , b u tf or c o m p o n e n t s o f m o r p h o l o g i c a l ly o b t a i n e d w o r d s a s w e l l. T h e a p p r o a c h w e p r e s e n ti s n o t s p e c i a l l y ta i l o r e d f o r n a m e r e c o g n i t i o n , b u t b e c a u s e i t i s f u l l y a d a p t i v e i t i s l i k e l yt h a t it w o u l d y i e l d g o o d p e r f o r m a n c e o n n a m e s i f l is ts o f n a m e s w e r e p r o v i d e d a ss u p p l e m e n t a r y t ra i n i ng t ex t. T h is h a s n o t y e t b e e n t e s t e d .3. Lang uage M odeling u sing PPMS t at is ti ca l l a n g u a g e m o d e l s a r e w e l l d e v e l o p e d i n t h e f i el d o f t e xt c o m p r e s s i o n . C o m -p r e s s i o n m e t h o d s a r e u s u a l l y d i v i d e d i n t o s y m b o l w i s e a n d d i c t i o n a r y s c h e m e s (B ell,C l e ar y, a n d W i t te n , 1 99 0). S y m b o l w i s e m e t h o d s , w h i c h g e n e r a l ly m a k e u s e o f a d a p -t i v e l y g e n e r a t e d s ta t is t ic s , g i v e e x c e l l e n t c o m p r e s s i o n - - i n f a ct , t h e y i n c l u d e t h e b e s tk n o w n m e t h o d s . A l t h o u g h d i c ti o n a ry m e t h o d s s u c h a s t h e Z i v- L e m p e l s c h e m e s p er -f o r m l e s s w e l l , t h e y a r e u s e d i n p r a c t i c a l c o m p r e s s i o n u t i l i t i e s l i k e U n i x compress a n dgzip b e c a u s e t h e y a r e f a s t .I n o u r w o r k w e u s e t h e p r e d ic t i o n b y p a rt ia l m a t c h in g ( PP M ) s y m b o l w i s e c o m -p r e s s i o n s c h e m e ( C l e a r y a n d W i tt e n 1 984 ), w h i c h h a s b e c o m e a b e n c h m a r k i n th ec o m p r e s s i o n c o m m u n i t y . I t g e n e r a t e s " p r e d i c t i o n s " f o r e a c h i n p u t s y m b o l in t u rn .E a c h p r e d i c t i o n t a k e s t h e f o r m o f a p r o b a b i l i t y d i s t r i b u t i o n t h a t i s p r o v i d e d t o a ne n c o d e r . T h e e n c o d e r i s u s u a l l y a n a r i t h m e t i c c o d e r ; th e d e t a i l s o f c o d i n g a r e o f n or e l e v a n c e t o t h i s p a p e r .P P M is a n n - g r a m a p p r o a c h t h a t u s e s f i n it e -c o n te x t m o d e l s o f c h a ra c t er s, w h e r et h e p r e v i o u s f e w ( s a y th r ee ) c h a r a c t er s p r e d i c t t h e u p c o m i n g o n e . T h e c o n d i ti o n a lp r o b a b i l i t y d i s t r i b u t i o n o f c h a r a c t e rs , c o n d i t i o n e d o n t h e p r e c e d i n g f e w c h a r a c t e r s,i s m a i n t a i n e d a n d u p d a t e d a s e a c h c h a ra c t e r o f i n p u t i s p r o c e s s e d . T h is d i s t ri b u t io n ,a l o n g w i t h t h e a c t u a l v a l u e o f t h e p r e c e d i n g f e w c h a r a c te r s, i s u s e d t o p r e d i c t e a c h u p -c o m i n g s y m b o l . E x a c tl y t h e s a m e d i s t ri b u t i o n s a r e m a i n t a i n e d b y t h e d e c o d e r , w h i c hu p d a t e s t h e a p p r o p r i a t e d i s t r i b u t i o n a s e a c h c h a r a c t e r is r e c e i v e d . T h i s i s w h a t w eca l l adaptive modeling: b o t h e n c o d e r a n d d e c o d e r m a i n t a i n t h e s a m e m o d e l s - - n o tb y c o m m u n i c a t i n g t h e m o d e l s d i re c tl y, b u t b y u p d a t i n g t h e m i n p r e c i s e ly th e s a m ew a y .

R a t h e r t h a n u s i n g a f i x e d c o n t e x t l e n g t h ( t h r e e w a s s u g g e s t e d a b o v e ) , t h e P P Mm e t h o d c h o o s e s a m a x i m u m c o n t e x t l e n g t h a n d m a i n t a i n s s ta t is ti cs f o r t h is a n d a lls h o r t e r c o n t e x t s . T h e m a x i m u m i s f iv e i n m o s t o f t h e e x p e r i m e n t s b e l o w , a n d s t a ti s ti c sa r e m a i n t a i n e d f o r m o d e l s o f o r d e r 5 , 4 , 3 , 2 , 1 , a n d 0 . T h e s e a r e n o t s t o r e d s e p a r a t e l y ;t h e y a r e a l l k e p t i n a s i n g l e t r i e s t r u c tu r e .

P P M i n c o r p o r a t e s a s im p l e a n d h i g h l y e f f ec t iv e m e t h o d t o c o m b i n e t h e p r e d i c t io n so f t h e m o d e l s o f d if fe r e n t o r d e r - - o f t e n c a l le d t h e p r o b l e m o f " b a c k o ff . " T o e n c o d et h e n e x t s y m b o l , it s t a rt s w i t h t h e m a x i m u m - o r d e r m o d e l ( o r de r 5). I f t h a t m o d e lc o n t a i n s a p r e d i c t i o n f o r t h e u p c o m i n g c h a r a c te r , th e c h a r a c t e r i s t r a n s m i t t e d a c c o r d i n gt o t h e o r d e r 5 d i s t r i b u t i o n . O t h e r w i s e , b o t h e n c o d e r a n d d e c o d e r e s c a p e d o w n t oo r d e r 4 . T h e r e a r e t w o p o s s i b l e s i t u a t io n s . I f t h e o r d e r 5 c o n t e x t - - t h a t i s, t h e p r e c e d i n gf i v e - c h a r a c t e r s e q u e n c e - - h a s n o t b e e n e n c o u n t e r e d b e f o r e , t h e n e s c a p e t o o r d e r 4 i si n e v it a b le , a n d b o t h e n c o d e r a n d d e c o d e r c a n d e d u c e t h a t f ac t w i t h o u t r e q u i r in g a n yc o m m u n i c a t i o n . I f n o t , th a t i s, i f t h e p r e c e d i n g f i ve c h a r a c t e r s h a v e b e e n e n c o u n t e r e di n s e q u e n c e b e f o r e b u t n o t f o l l o w e d b y t h e u p c o m i n g c h a ra c te r , th e n o n l y t h e e n c o d e rk n o w s t h a t a n e s c a p e i s n e c e s s a r y . I n t h is c a s e , th e r e f o re , i t m u s t s i g n a l t h is f a c t t o t h ed e c o d e r b y t ra n s m i tt in g a n e s c a p e e v e n t - - a n d s p a c e m u s t b e r e s e r v e d fo r t hi s e v e n ti n e v e r y p r o b a b i l i ty d i s t r ib u t i o n t h a t t h e e n c o d e r a n d d e c o d e r m a i n t a in .

379


6/20


T a b l e 1PPM model after processing the string t o b e o r n o t t o b e ; c = count, p = predictionprobability.O r d e r 2 O r d e r 1 O r d e r 0Prediction c p Prediction c p Prediction c pbe --~ o 1 1/2 b --* e 2 3/4

e s c 1 1/2 ~ esc 1 1/4eo ~ r 1 1/2 e ~ o 1 1/2---* es c 1 1/2 --* esc 1 1/2no --* t 1 1/2 n --. o 1 1/2--* esc 1 1/2 --* sc 1 1/2ob --~ e 2 3/4 o ~ b 2 3/8--* esc 1 1/4 ---* r 1 1/8or --* n 1 1/2 --~ t 1 1/8

es c 1 1/2 --* esc 3 3/8ot --* t 1 1 /2 r - -* n 1 1/2--* esc 1 1/2 --* esc 1 1/2rn --* o 1 1/2 t --~ o 2 1/2

es c 1 1/2 ---* t 1 1/6to ---* b 2 3/4 --~ esc 2 1/3---* es c 1 1/4tt ---* o 1 1/2- * e s c 1 1/2

--~ b 2 3/26--. e 2 3/26--. n 1 1/26--* o 4 7/26--* r 1 1/26t 3 5/26--* esc 6 3/13

O r d e r -1Prediction c p--* A 1 1/ IA I

Once any necessary escape event has been transmit ted and received, both encode rand decoder agree that the upco ming character wil l be code d by the order 4 model. Ofcourse, this ma y not be possible either, an d further escapes ma y take place. Ultimately,the order 0 model may be reached; in this case the character can be transmitted if i t isone that has occurred before. Otherwise, there is one further escape (to an order -1model), and the stand ard ASCII representation of the character is sent.

The only remai ning qu estion is ho w to calculate the probabilities fro m the cou nt s- -a simple matter once we have resolved how much space to allocate for the escapeprobability. There has been m uc h dis cussion of this question, and several differentmethods have been proposed. Our experiments calculate the escape probabil i ty in aparticular context as la

nwhere n is the number of times that context has appeared and d is the number ofdifferent symbols that have directly followed it (Howard 1993). The probability of acharacter that has occurred c times in that context is

1C-- --2

n

Since there are d such characters, and their counts sum to n, it is easy to confirm thatthe probabilities in the distribution (including the escape probability) sum to 1.

To illustrate the PPM mo del ing technique, Table i show s the mo del after the stringt o b e o r n o t t o b e has been processed. In this il lustration the maximum model order is 2(not 5 as stated above), and each prediction has a count c and a prediction probabilityp. The probability is determined from the counts associated with the prediction usingthe formula that we discuss above. IAI is the size of the alphabet, and it is this thatdetermines the probability for each unseen character.

380


7/20


The model in Table 1 is used as follows: Suppose the character following tobe-ornottobe is o. Since the order 2 context is be , and the upcoming symbol has alreadybeen seen once in this context, the order 2 model is used for encoding in this case,and the encoding probability is 1/2. Thus the symbol o would be encoded in 1 bit. Ifthe next character, instead of o, were t, this has not been seen in the current order 2context (which is still be). Consequently an order 2 escape event is coded (probability1/2, again in the be context), and the context is truncated to e. Checking the order 1model, the upcoming character t has no t been seen in this context, so an order I escapeevent is coded (probability 1/2 in the e context) and the context is truncated to thenull context, corresponding to the o rder 0 model. The character t is finally encoded inthis model, with probability 5/26. Thus three encodings occur for this one character,with probabilities 1/2, 1/2, and 5/26 respectively, which together amount to just over5 bits of information. If the upcoming character had been x instead of t, a final level ofescape, this time to order 0, would have occurred (probability 3/13), and the x wouldbe encoded wi th a probabili ty of 1/256 (assuming that the alphabe t has 256 characters)for a total of just over 10 bits.

It is clear from Table 1 that, in the context tobeornottobe, if the next character is o itwill be encoded by the order 2 model. Hence if an escape occurs down to order 1, thenext character cannot be o. This makes it unnecessary to reserve probabili ty space forthe occurrence of o in the order 1 (or order 0 or order -1) models. This idea, whichis called exclusion, can be exploited to improve compression. A character tha t occursat one level is excluded from all lower-order predictions, allowing a greater shareof the probability space to be allocated to the other characters in these lower-ordermodels (Bell, Cleary, and Witten 1990). For example, if the character b were to followtobeornottobe it woul d be encoded with probabilities (1/2,1/2, 3/26), w ithout exclusion,leading to a coding requirement of 5.1 bits. However, if exclusion was exploited, bothencoder and decoder will recognize that escape from order 1 to order 0 is inevitablebecause the order 1 model adds no characters that were not already predicted bythe order 2 model. Thus the coding probabilities will be (1/2,1, 3/18) with exclusion,reducing the total code space for b to 3.6 bits. An important special case of the exclusionpolicy occurs at the lowest-level model: for example, the x at the end of the previousparagraph wou ld finally be encoded wi th a probability of 1/250 rather than 1/256because characters that have already occurred can never be predicted in the order -1context.

One slight further improvement to PPM is incorporated in the experiments: de-terministic scaling (Teahan 1998). Although it probably has negligible effect on ouroverall results, we record it here for completeness. Experiments show that in deter-ministic contexts, for which d = 1, the probability that the single character that hasoccurred before reappears is greater than the 1 - 1/(2n) implied by the above esti-mator. Consequently, in this case the probability is increased in an ad hoc manner to1 - 1/(6n) .4 . U s i n g a H i d d e n M a r k o v M o d e l t o I n s e r t S p a c e sInserting spaces into text can be v iewed as a hidden Markov modeling problem. Beingentirely adaptive, the meth od works regardless of what language it is used with. Forpedagogical purposes, we will explain it with English text.Between every pai r of characters lies a potential space. Figure 4(a) illustrates themodel for the fragment tobeornottobe. It contains one node for each letter in the textand one for each possible intercharacter space (represented as dots in the figure).Any given assignment of word boundaries to this text fragment will correspond to

381


8/20


(a)

/ ' , , / \ / \ / \ / \ / 't " 0 " ~ " - ' - - ~ b - e ~ o = r ."

/ \ / \ / \ / \ / \ / \ / \-" -" ~n = o = t ' ~ t = o ~ b = e/ \ / \ / \ / \ / \ / '

t D o ~ b -e = o ' = r -

( b ) \ / \ / \ I \ / \ i \ / \ / \-- ~n i, o .- ,- .. -, -=- ,~ t =- t ~ o ' = b ~ ~ .Figure 4Hidden Markov Model for Space Insertion.t ' - -~" 0 -"" o "-"" b ~ b "-" e -*" e "~ o ~ o ""- r "- ' ~r o~ n -- '=

t " to ," ob '= be = eo ~ or .~ r n - ~ , - -

( a )o?C"?C'? C o7.2 7C " \

. " no , - o t = t t ,= to ,= ob . - be- - - , - .

t = to "ob = b e = e o "-or L r n ---~

C o )n , - ' - , o - - ' ~ ' o o~ , t ~ t , , ' ~ - t ~ t , ~ o o ~ o . ' e ~ o b ~ o o

= n o o t ~ t ' ( ~ t o ~ o b = b e ' , - ' ~ ' ,

Figure 5Hidden Markov model for space insertion using an order 1 model.

a path through the model from beginning (at the left) to end (at the right). Of allpossible paths, we seek the one that gives the best compression according to the PPMtext compression method, suitably primed with English text. This path is the correctpath, co rresponding to the text t o b e o r n o t t o b e , shown in bold in Figure 4(b).

4 .1 M a r k o v M o d e l i n g w i t h C o n t ex tFigure 4 can easily be converted into a Markov model for a given order of PPM.Suppose we use order 1: then we rewrite Figure 4(a) so that the states are bigrams,as shown in Figure 5(a). The interpretation of each state is that it corresponds to thelas t character of the string that labels the state. The very first state, labeled t, has noprior context--in PPM terms, that character will be transmitted by escaping down toorder 0 (or -1). Again, the bold arrows in Figure 5(b) shows the path correspondingto the string with spaces inserted correctly.

382


9/20


/ , . ---@ __ j G / , . - - .o' ' - z - @ / " " ' @ ', . . . @ " - , / @ " -O t oto ,,,,, @(a) (b) (c)Figure 6Growing a tree for order 1 modeling of tobeornottobe.

Similar models could be written for higher-order versions of PPM. For example,with an order 3 model, states would be labeled by strings of length four (except forthe first few states, where the context would be truncated because they occur at thebeginning of the string). And each state wou ld have variants co rresponding to alldifferent ways of inserting space into the four-character string. For example, the statescorresponding to the sixth character of t o b e o r n o t t o b e would include b e o r and e e o r , aswell as , e o r , e o e r and e o r . It is not hard to see that the number of states correspondingto a particular character of the input string increases with model order according tothe Fibonnacci series. Figure 5(a) shows t wo states per symbol for order 1, there arethree states per symbol for order 2, five for order 3, eight for order 4, thirteen fororder 5, and so on.4 . 2 T h e S p a c e I n s e r t i o n A l g o r i t h mGiven a hidden Markov model like the one in Figure 5(a), where probabilities aresuppl ied for each edge according to an order I compression model, the space insertionproblem is tantamount to finding the sequence of states through the model, frombeginning to end, that maximizes the total probability--or, equivalently, that minimizesthe num ber of bits required to represent the text according to that model. The followingViterbi-style algorithm can be used to solve this problem. Beginning at the initial state,the procedure traces through the model , recording at each state the highest probabilityof reaching that state from the beginning. Thus the two descendants of the start node,nodes to and te, are assigned the probability of o and e, conditioned in each case ont being the prior character, respectively. As more arcs are traversed, the associatedprobabilities are multiplied: thus the node eo receives the product of the probabilityof conditioned on t and of o conditioned on e. When the node ob is reached, it isassigned the g r e a t e r of the probabilities associated with the two incoming transitions,and so on throughout the model. This is the standard dynam ic programm ing techniqueof storing with each state the result of the best way of reaching that state, and usingthis result to extend the calculation to the next state. To find the optimal state sequenceis simply a matter of recording with each state which incoming transition is associatedwith the greatest probability, and traversing tha t path in the reverse direction once thefinal node is reached.These models can be generated dynamical ly by proceeding to predict each char-acter in turn. Figure 6(a) shows the beginning of the tree that results. First, the initialnode t is expanded into its two children, te and to . Then, these are expanded in turn.The first has one child, eo, because a space cannot be followed by another space. Thesecond has two, o and oh. Figure 6(b) shows the further expansion of the o node.However, the two children that are created already exist in the tree, and so the existingversions of these nodes are used instead, as in Figure 6(c). If this procedure is con-

383


10/20


. . . . j / t~ o o ~ , b b - ~ , e e . ~ - o o ~ - r r ' - ' ~ ' n /null = t = t o = o b b e - e o = o r = r n ~f---Gn " = ' o o o " ~ ' . " * t t . ~ , t t , ~ o o ' ~ " o b b ' - = ' o e \

= n o - o t - t t = - t o = o b . - b e - - ~

Figure 7The space insertion procedure as implemented.

tinued, the graph structure of Figure 5(a) will be created. During creation, probabilityvalues can be assigned to the nodes, and back pointers inserted to record the best pathto each node.

The illustration in Figure 6 is for an order I model, but exactly the same procedureapplies for higher-order PPM models.4 . 3 I mplementa t io n o f the Spa ce I ns er t io n Alg o r i thmOur implementation uses a slight variant of the above procedure for finding the opti-mal place to insert spaces. At each stage, we consider the possibility of adding eitherthe next character, or the next character followed by a space. This generates the struc-ture shown in Figure 7. Starting with the null string, both t and to are generated assuccessor states. From each of these states, either o or oe can be added, and these yieldthe next states shown. The procedure continues, growing the trellis structure usingan incremental strategy similar to that illustrated in Figure 6, but modified to takeinto account the new growth strategy of adding either the next character or the nextcharacter followed by a space.

The search strategy we use is a variant of the stack algorithm for sequential de-coding (Anderson and Mohan 1984). As new nodes are generated, an ordered list ismaintained of the best paths generated so far. Only the best path is extended. Themetric used to evaluate a path is the number of bits required for the segmentationsequence it represents, wh en compressed by the PPM model.

It is necessary to delete paths from the list in order to make room for newlygenerated ones. We remove all paths that were more than m nodes shorter than thebest path so far, where m is the order of the PPM model (5 in our experiments). Wereasoned that it is extremely unlikely--at least for natural language sequences--thatsuch a path would ever grow to outperform the current best path, because it alreadylags behind in code length despite the fact that m further letters must be encoded.5 . Ex per imenta l Ev a lua t io nBefore describing experiments to assess the success of the new word segmentationmethod, we first discuss measures that are used to evaluate the accuracy of automaticsegmentation. We then examine the application of the new segmentation method toEnglish text, and show how it achieves results that significantly outpe rform the state ofthe art. Next we describe application to a manu ally segmented corpus of Chinese text;again, excellent results are achieved. In a further experiment where we apply a modelgenerated from the corpus to a new, independent , test file, performance deterioratesconsiderably--as one migh t expect. We then apply the met hod to a different corpus,

384


11/20

T e a h a n , W e n , M c N a b , a n d W i t te n C h i n e s e W o r d S e g m e n t a t i o n

a n d i n v e st i g a t e h o w w e l l t h e m o d e l t r a n s fe r s f ro m o n e c o r p u s t o a n o t he r . W e e n d w i t ha d i s cu s s io n o f h o w t h e r es u l t s v a r y w i t h t h e o r d e r o f t h e c o m p r e s s i o n m o d e l u s e d t od r i v e t h e s e g m e n t e r .5 . 1 Mea s ur ing the Q ua l i ty o f Seg menta t io nW e u s e t h r e e m e a s u r e s t o e v a l u a t e t h e a c c u r a c y o f a u t o m a t i c s e g m e n t a t i o n : r e ca l l,p r e c is i o n , a n d e r r o r ra t e. A l l e v a l u a t i o n s u s e h a n d - s e g m e n t a t i o n a s t h e g o l d s t a n -d a r d , w h i c h t h e a u t o m a t i c m e t h o d s t r i v e s t o a t ta i n . T o d e f i n e th e m , w e u s e t h e t e r m s

Necn = c + e

N u m b e r o f w o r d s o c c u rr in g in t h e h a n d - s e g m e n t a t i o nN u m b e r o f w o r d s i n c o r r e c tl y i d e n ti f ie d b y th e a u t o m a t i c m e t h o dN u m b e r o f w o r d s c o r r e c t ly i d e n ti f ie d b y t h e a u t o m a t i c m e t h o dN u m b e r o f w o r d s i d e n t i f i e d b y t h e a u t o m a t i c m e t h o d

R e c a ll a n d p r e c i s i o n ar e s t a n d a r d i n f o r m a t i o n r e t r i e v a l m e a s u r e s u s e d t o as s es s t h eq u a l i t y o f a r e t r ie v a l s y s te m i n t e rm s o f h o w m a n y o f t h e r e le v a n t d o c u m e n t s a rer e t r i e v e d ( re ca ll ) a n d h o w m a n y o f t h e r e t r i e v e d d o c u m e n t s a r e r e l e v a n t ( p re c is io n ) :

cr e c a l l - N 'cp r e c i s i o n - 1/"

T h e o v e r a l l e r r o r ra t e c a n b e d e f i n e d a se r ro r r a t e = - -N "

T h i s i n p r i n c i p l e c a n g i v e m i s l e a d i n g r e s u l t s - - a n e x t r e m e c o n d i t i o n is w h e r e t h e a u t o -m a t i c m e t h o d o n l y i d e n ti f ie s a si n g le w o r d , l e a d i n g t o a v e r y s m a l l e r r o r ra t e o f 1/Nd e s p i t e t h e f a c t t h a t a ll w o r d s b u t o n e a r e m i s i d e n t i f ie d . H o w e v e r , in a ll o u r e x p e r -i m e n t s e x t r e m e c o n d i t i o n s d o n o t o c c u r b e c a u s e n is a l w a y s c l o se to N a n d w e f i n dt h a t t h e e r r o r r a t e i s a u s e f u l o v e r a l l i n d i c a t o r o f t h e q u a l i t y o f s e g m e n t a t i o n . W e a l sou s e d t h e F - m e a s u r e t o c o m p a r e o u r r e su l ts w i t h o t h e rs :

F - m e a s u r e = 2 x P r e c i s i o n x R e c a l lP r e c i s i o n + R e c a llI f t h e a u t o m a t i c m e t h o d p r o d u c e s t h e s a m e n u m b e r o f w o r d s a s t h e h a n d - s e g m e n t a t i o n ,

r e ca l l a n d p r e c i s i o n b o t h b e c o m e e q u a l t o o n e m i n u s t h e e r r o r r a t e . A p e r f e c t s e g m e n t e rw i l l h a v e a n e r r o r r a t e o f z e ro a n d r e ca l l a n d p r e c i s i o n o f 10 0% .

A l l th e s e m e a s u r e s c a n b e c a lc u l a te d a u t o m a t i c a l l y f r o m a m a c h i n e - s e g m e n t e dt e xt , a l o n g w i t h t h e h a n d - s e g m e n t e d g o l d s t a n d a r d . B o t h te x t s a r e i d e n t i c a l e x c e p t fo rt h e p o i n t s w h e r e s p a c e s a re i n s e rt e d : t h u s w e r e c o r d j u s t t h e s t ar t a n d e n d p o s i t i o n s o fe a c h w o r d i n b o t h v e r s io n s . F o r e x a m p l e , " A B C A E D F " in t h e m a c h i n e - s e g m e n t e dv e r s i o n i s m a p p e d t o ( 1,1 ) (2 ,3 ) (4 ,6 ) (7 ,7 ), a n d " A B C A E D F " i n t h e h a n d - s e g m e n t e dv e r s i o n b e c o m e s ( 1 ,1 ) ( 2, 3) ( 4,4 ) ( 5, 6) (7 ,7 ). T h e n u m b e r o f c o r r e c t l y a n d i n c o r r e c t l ys e g m e n t e d w o r d s is c o u n t e d b y c o m p a r i n g t h e se t w o s et s o f p o s i t io n s , i n d ic a t e d b ym a t c h e d a n d m i s m a t c h e d p a ir s , r e s p e c t i v e l y - - t h r e e co r re c t a n d t w o i n co r re c t, in t h ise x a m p l e .5 . 2 Appl i ca t io n to Eng l i s h T ex tI t m a y b e h e l p f u l f o r n o n - C h i n e s e r e a d e r s t o b r i e f l y i l lu s t ra t e t h e s u c c e s s o f th e s p a c ei n s e r t i o n m e t h o d b y s h o w i n g i ts a p p l i c a t i o n t o E n g l i sh t e x t. T h e f i rs t p a r t o f T a b le 2

3 8 5


12/20


Table 2Segmenting words in English text.Original textWithout spacesPPM methodUSeg method

the unit of New York-based Loews Corp that makes Kent cigarettes stoppedusing crocidolite in its Micronite cigarette filters in 1956.TheunitofNewYork-basedLoewsCorpthatmakesKentcigarettesstoppedusing-crocidoliteinitsMicronitecigarettefiltersin1956.the unit of New York-based LoewsCorp that makes Kent cigarettes stoppedusing croc idolite in its Micronite cigarette filters in 1956.the unit of New York-based Loews Corp that makes Kent cigarettes stoppedusing c roc id o lite inits Micron it e cigarette filters in 1956.

shows the original text, with spaces in the proper places. The second shows the textwith spaces removed, used as input to the segmentation procedure. The third showsthe output of the PPM-based method described above, while the fourth shows, forcomparison, the o utput of a word-based meth od for predicting the position of spaces,USeg (Ponte and Croft 1996).For this experiment (first reported by Teahan et al. [1998]), PPM was trained onthe milli on-word Brown corpus (Kucera and Francis 1967). USeg was trained on a farlarger corpus containing 1 Gb of data from the Tipster collection (Broglio, Callan, andCroft 1994). Both were tested on the same 500 Kb extract from the Wall Street Journal.The recall and precision for PPM were both 99.52%, while the corresponding figuresfor Useg were 93.56% and 90.03%, respectively. This result is particularly noteworthybecause PPM had been trained on only a small fraction of the a mount of text neededfor the word-based scheme.The same example was used by Ponte and Croft (1996), and the improved perfor-mance of the character-based metho d is evident even in this small example. Al thou ghthe word Micronite does not occur in the Brown Corpus, it was correctly segmentedusing PPM. Likewise, inits was correctly split into in and its. PPM makes just twomistakes. First, a space was not inserted into LoewsCorp because the single "word"requires only 54.3 bits to encode, whereas Loews Corp requires 55.0 bits. Second, anextra space was added to crocidolite because that reduced the number of bits requiredfrom 58.7 to 55.3.5 . 3 A p p l i c a t i o n t o a C o r p u s o f C h i n e s e T e x tOur first series of experiments used part of Guo Jin's Manda rin Chinese PH corpus,containing one million words of news paper stories from the Xinhua news agency of PRChina writ ten between Janu ary 1990 and March 1991. It is represented in the st andardGB coding scheme.Table 3 shows the distribution of word lengths in the corpus. Single-characterwords are the most frequent; these and bigrams together constitute almost 94% ofwords. Nearly half the characters appear as constituents of two-character words.Some published figures for Chinese language statistics indicate that this corpus mayoverrepresent single-character words and underrepresent bigrams--for example, Liu(1987) gives figures for mode rn Chinese of 5%, 75%, 14%, and 6% for one-character,two-character, three-character, and longer words, respectively. However, it has been ar-gued that considering the inherent uncertain ty in Chinese word segmentation, general-purpose segmentation algorithms should segment aggressively rather than conserva-tively (Wu 1998); consequently this corpus seems appropriate for our use.

386


13/20

T e a h a n, W e n , M c N a b , a n d W i tt e n C h i n e s e W o r d S e g m e n t a t io n

Table 3Dis t r ibu tion o f w ord l eng th inthe co rpus .L e n g t h W o r d s C h a r a c t e r s

1 55.6 % 36.2%2 38.2 % 49.9%3 4.2% 8.2%4 1.6% 4.0%5 0.2% 0.8%Ov er 5 0.2% 0.9%

Table 4Resu l t s fo r f ive 500 -word segm en t s f rom the Ch inese co rpus(ma nua l ly checked f igu res i n paren theses ).Fi le Error ra te Recall Precis ion F-m easu re

1 1 .2 % ( 1 . 0 % ) 9 8 . 4 % ( 9 8 .6 % ) 9 8 . 8 % ( 9 9 .0 % ) 9 8 . 6 % (98.8%)2 3 . 6 % 3 . 0 % ) 9 6 . 4 % ( 9 6 .8 % ) 9 6 . 4 % ( 9 7 .0 % ) 9 6 . 4 % (9 6.9 % )3 4 . 2 % 4 . 0 % ) 9 5 . 0 % ( 9 5 .8 % ) 9 5 . 8 % ( 9 6 .0 % ) 9 5 . 4 % ( 95 .9 % )4 6 . 4 % 5 . 2 % ) 9 1 . 0 % ( 9 2 .2 % ) 9 3 . 4 % ( 9 4 .7 % ) 9 2 . 2 % (9 3.4 % )5 6 . 6 % 5 . 0 % ) 8 6 . 2 % ( 9 0 .4 % ) 9 2 . 9 % ( 9 4 .8 % ) 8 9 . 4 % ( 92 .5 % )

T a b le 4 s h o w s t h e r e s u l t s f o r f iv e 5 0 0 - w o r d t e s t fi le s f r o m t h e c o r p u s . W e t o o kp a r t o f t h e c o r p u s t h a t w a s n o t u s e d f o r t r a in i n g , d i v i d e d i t i n to 5 0 0 - w o r d s e g m e n t s ,r e m o v e d a l l sp a c e s, a n d r a n d o m l y c h o s e f iv e s e g m e n t s a s t e s t f ile s. T h e r e s u l ts s h o wa n e r r o r r a t e v a r y i n g f r o m 1 .2 % to 6 .6 % . T h e r e s u l t i n g F - m e a s u r e s i n d i c a t e t h a t t h e n e wa l g o r i t h m p e r f o r m s b e t t e r t h a n t h e o n e d e s c r i b e d i n H o c k e n m a i e r a n d B r e w ( 1998),w h o r e p o r t a n F - m e a s u r e o f 8 7.9 u s i n g t r i g r a m r u l e s . T h i s i s p a r t i c u l a r l y s i g n i f i c a n tb e c a u s e t h e tw o a l g o r i t h m s u s e t r a i n i n g a n d t e st d a t a f r o m t h e s a m e s o u rc e .

T h e r e s u lt s w e r e a l so v e r i fi e d b y c h e c k i n g t h e m m a n u a l l y . T hi s p r o d u c e s s l i g h tl yd i f fe r e n t r e su l ts , f o r t w o r e a so n s . F i rs tl y, h u m a n j u d g m e n t s o m e t i m e s a c c e p t s a s e g -m e n t a t i o n a s c o r re c t e v e n t h o u g h i t d o e s n o t c o r r e s p o n d e x a c t ly w i t h t h e c o r p u s v e r -s i on . F o r e x a m p l e , t h e l a s t w o r d i n ~ -~/~ , ~ , ~ is c o u n t e d a s c o r r e c t e v e n t h o u g hi n t h e c o r p u s i t i s w r i t t e n q2~ ~ ~ . S e c o n d l y , i m p r o p e r s e g m e n t a t i o n s s u c h a s" ~ , ~ , , ~ a n d ~ '~ J J l ' l~ o c c u r i n t h e c o rp u s . W h e n t he p r o g r a m m a k e s t he s a m em i s t a k e s , i t c o u n t s a s c o r re c t i n a u t o m a t i c c h e c k i n g , b u t i n c o rr e c t in m a n u a l c h e c k i n g .T h e s e t w o k i n d s o f e r r o r v i r t u a l ly c a n c e l e d e a c h o t h e r: w h e n c h e c k e d m a n u a l l y , fi le 3 ,f o r e x a m p l e , h a s f i v e f e w e r e r r o r s f o r th e f i r st r e a s o n a n d s ix m o r e f o r t h e s e c o n d r e a -s o n , g i v i n g e r r o r c o u n t s o f 21 a n d 2 0 f o r a u t o m a t i c a n d m a n u a l c h e c k i n g , re s p e c t iv e l y .5 .4 A p p l i c a t i o n t o I n d e p e n d e n t C h i n e s e T e xt F i le sI n a s e c o n d t es t, m o d e l s f r o m t h is c o r p u s w e r e e v a l u a t e d o n c o m p l e t e l y s e p a r a ted a t a p r o v i d e d b y t h e I n s t i t u te o f C o m p u t a t i o n a l L i n g u i s t ic s o f P e k i n g U n i v e rs i ty . T h isc o n t a i n e d 3 9 s e n t e n c e s (75 2 c h a r a c t e rs ) , s o m e o f w h i c h a r e c o m p o u n d s e n t e n c e s . S i n c en o p r e s e g m e n t e d v e r s i o n w a s a v a il a bl e , a l l c h e c k i n g w a s m a n u a l .T h i s t e s t i s i n t e r e s t i n g b e c a u s e i t i n c l u d e s s e v e r a l s e n t e n c e s t h a t a r e e a s i l y m i s -u n d e r s t o o d , t h r e e o f w h i c h a r e s h o w n i n F i g u r e 8 . I n t h e fi rs t, w h i c h r e a d s " I h a v el e a r n e d a lo t f r o m i t ," t he s e c o n d a n d t h i r d c h a r a c te r s c o m b i n e i n t o ' f r o m i t ' a n d t h e

387


14/20


Figure 8Three examples of easily misunderstood sentences.

Table 5Error rate (mean and sd) for 1,000-word files from PHand Rocling corpora.TrainingPH corpus Rocling corpus

Testing PH files 42 + 10.23 169.2 19.70Rocling files 133.44- 19.58 44.8 -4- 10.83

fourth and fifth characters combine into 'have learned.' However, the third and fourthcharacters taken together mean 'midd le school,' which does not occur in the meanin gof the sentence. In the second and third sentences, the first three characters are thesame. In the second, "physics is very hard to learn," the second and third charactersshould be separated by a space, so that the third character can combine with the fol-lowing two characters to mean 'to learn.' However, in the third, "physics is one kindof science," the first three characters make a single word meaning 'physics.'

The error rate, recall and precision for this test material are 10.8%, 93.4%, and89.6%, respectively. Performance is signif icant ly worse than that of Table 4, because ofthe nature of the test file. Precision is distinctly lower than recall--recall fares betterbecause many relevant words are still retrieved, whereas precision suffers because theautomatic segmenter placed too many word boundaries compared with the manualjudgment.

Two aspects of the training data have a profo und influence on the model 's accu-racy. First, some errors are obvious ly caused by deficiencies in the training data, suchas improperly segmented commo n words and names. Second, some errors stem fromthe topics covered by the corpus. It is not surpris ing that the error rate increases whenthe training and testing text represent different topic ar eas-- such as training on newstext and testing on medical text.5 .5 A p p l i c a t i o n t o t h e R o c l i n g C o r p u sThe Rocling Standard Segmentation Corpus contains about two million presegm entedwords, represented in the Big5 coding scheme. We converted it to GB, used one millionwords for training, and compared the resulting model to that generated from the PHdata, also trained on one million words. Both models were tested on 10 randomlychosen 1,000-word segments from each corpus (none of this material was used intraining).

The results are shown in Table 5, in terms of the mean and standa rd deviat ion (sd)of the errors. When the training and testing files come from the same corpus, resultsare good, with around 42 (for PH) and 45 (for Rocling) errors per thousand words.Not surprisingly, performance deteriorates significantly when the PH model is usedto segment the Rocling test files or vice versa.Several differences between the corpora influence performance. Many Englishwords are included in Rocling, whereas in PH only a few letters are used to rep-

388


15/20

T e a h a n , W e n , M c N a b , a n d W i t te n C h i n e s e W o r d S e g m e n t a t i o n

i 0 08 06 04 02 0

00 . 5 M I M 1 . 5M 2 MFigure 9Ef f e ct o f th e a m o u n t o f t r a in in g d a ta o n th e p e r f o r m a n c e f o r e a c h t e s t fi le .

Table 6Er r o r r a te ( m e a n a n d s d ) f o r d i f f e r e n t a m o u n ts o f t ra in in g d a ta .0 .5 M w o r d s 1 M w o r d s 1 .5 M w o r d s 2 M w o r d s

E rro r 63.3 + 13.69 44.8 + 10.83 38.8 + 8.60 35.1 -}- 6.74

r e s e n t c e r t a i n i t e m s . P e r c e n t a g e s a r e r e p r e s e n t e d a s 9 0 % o r ~ t L - ~ % i n R o c l i n g , i n s t e a do f - ~ 3 \ Z ~ L ' ~ i n t h e P H c o r p u s . Q u o t a t i o n m a r k s a l so d if fe r: [ ] i n R o c l i n g b u t .. . .i n P H . I n a d d i t i o n , a s i s o n l y t o b e e x p e c t e d i n a n y l a r g e c o l le c t i o n o f n a t u r a l l a n g u a g e ,t y p o g r a p h i c a l e r r o r s o c c u r i n b o t h c o r p o r a .T h e o v e r a l l r e s u l t in d i c a t e s t h a t o u r a l g o r i t h m is ro b u s t . I t p e r f o r m s w e l l s o l o n ga s t h e t r a in i n g a n d t e s t i n g d a t a c o m e f r o m t h e s a m e s o u r ce .5 . 6 E f f e c t o f t h e A m o u n t o f T r a i n i n g D a t aF o r t h e R o c l in g c o r p u s , w e e x p e r i m e n t e d w i t h d i f fe r e n t a m o u n t s o f t r a i n i n g d a ta . F o u rm o d e l s w e r e t r a i n e d w i t h s u c c e s s iv e l y l a r g e r a m o u n t s o f d a ta , 0 .5 M , 1 M , 1 .5 M , a n d 2 Mw o r d s , e a c h t r a i n i n g f il e b e i n g a n e x t e n s i o n o f t h e t e x t i n t h e p r e c e d i n g t r a i n i n g f i le .T h e f o u r m o d e l s w e r e t e s t e d o n t h e 10 r a n d o m l y - c h o s e n 1 , 0 0 0 -w o r d R o c li n g s e g m e n t su s e d b e f o r e .

T h e r e s u l ts f o r th e i n d i v i d u a l t e s t f il es , i n t e r m s o f e r r o r ra t e p e r t h o u s a n d w o r d s ,a r e s h o w n i n F i g u r e 9 a n d s u m m a r i z e d i n T a b le 6. L a r g e r t r a i n in g s e t s g e n e r a l l y g i v es m a l l e r e rr o r, w h i c h i s o n l y t o b e e x p e c t e d - - a l t h o u g h t h e r e s u l ts f o r s o m e i n d i v i d u a lt e st fi le s f l a t t e n o u t a n d s h o w n o f u r t h e r i m p r o v e m e n t w i t h l a r g e r tr a i n i n g f il es , a n di n s o m e c a se s m o r e t r a i n i n g d a t a a c t u a l l y i n c re a s e s th e n u m b e r o f e r r o rs . O v e r a ll , t h ee r r o r r a t e is re d u c e d b y a b o u t 2 5 % f o r e a c h d o u b l i n g o f t h e t r a i n i n g d a t a .5 .7 M o d e l s o f D i f f e r e n t O r d e rW e h a v e e x p e r i m e n t e d w i t h c o m p r e s s i o n m o d e l s o f d i f f e re n t o r d e r s o n t h e P H c o r p u s .G e n e r a l l y sp e a k i n g , c o m p r e s s i o n o f t e x t i m p r o v e s a s m o d e l o r d e r i n c re a s e s , u p t oa p o i n t d e t e r m i n e d b y t h e l o g a r i t h m o f t h e s i z e o f t h e t r a i n i n g t e x t. T y p i c a ll y , l i tt l ec o m p r e s s i o n is g a i n e d b y g o i n g b e y o n d o r d e r 5 m o d e l s . F o r s e g m e n t a t i o n , w e o b s e r v em a n y e r r o r s w h e n a m o d e l o f o r d e r 1 is u se d . F o r o r d e r 3 m o d e l s , m o s t w o r d s a r es e g m e n t e d w i t h t h e s a m e e r ro r r a te a s f o r o r d e r 5 m o d e l s , t h o u g h s o m e w o r d s a rem i s s e d w h e n o r d e r 2 m o d e l s a r e u s e d .

F i g u re 1 0 s h o w s s o m e c a se s w h e r e t h e o r d e r 3 a n d o r d e r 5 m o d e l s p r o d u c e d i f -f e r e n t r e su l ts . S o m e o r d e r 5 e r r o r s a re c o r r e c t e d b y t h e o r d e r 3 m o d e l , t h o u g h o t h e r s

3 8 9


16/20


Order 3 model result Order 5 model result

Figure 10Results obtained when using order 3 and order 5 models.

appear even with the lower-order model. For example, both results in the first roware incorrect: no space should be inserted in this case, and the four characters shouldstand together. However, the order 3 result is to be preferred to the order 5 resultbecause both two-character words do at least make sense individually, whereas theinitial three characters in the order 5 version do not represent a word at all. In thesecond row, the order 5 result is incorrect because the second component does notrepresent a word. In the order 3 result, the first word, containing two characters, is aperson's name. The second word could also be correct as it stands, tho ugh it wou ld beequally correct if a space had been inserted between the two bigrams. On the whole,we find that the order 3 model gives the best results overall, although there is littledifference between orders 3, 4, and 5.6 . Ap pl i ca t ion s in a Dig i ta l L ibraryWord segmentation forms a valuable component of any Chinese digital library sys-tem. It improves full-text retrieval in two ways: higher-precision searching (that is,fewer false matches), and the ability to incorporate relevance ranking. This increasesthe effectiveness of full-text search and helps to provide users with better feedback.For example, one stud y concludes that the performance of an unsegmented character-based query is about 10% worse than that of the corresponding segmented query(Broglio, Callan, and Croft 1996). Many emerg ing digital libra ry technologies also pre-suppose word segmentation--for example, text summarization, document clustering,and keyphrase extraction all rely on word frequencies. These would not work wellon u nseg mented text because character frequencies do not generally reflect word fre-quencies.Once the source text in a digital library exceeds a few megabytes, full-text in-dexes are needed to process queries in a reasonable time (Witten, Moffat, and Bell1999). Full-text indexing was developed using languages where w ord boundari es arenotated (principally English), and the techniques that were developed rely on word-based processing. Although some techniques--for example stemming (Frakes 1992)and casefolding--are not applicable to Chinese information retrieval, many are. Ex-amples include heuristics for relevance ranking, and query expansion using a languagethesaurus.Of course, full-text indexes can be built from individual characters rather thanwords. However, these will suffer from the problem of low precision--searches willreturn many irrelevant documents, where the same characters are used in contextsdifferent from that of the query. To reduce false matches to a reasonable level, auxiliaryindexes (for example, sentence indexes) will have to be created. These will be muchlarger than regular word-based indexes of paragraphs or documen ts, and will still notbe as accurate.Information retrieval systems often rank the results of each search, giving pref-erence to documents that are more relevant to the query by placing them nearer thebeginning of the list. Relevance metrics are based on the observation that infrequent

390


17/20

Teahan, W en, M cNab, and Wit ten Chinese Word Segmentat ion

w o r d s a r e m o r e i m p o r t a n t t h a n c o m m o n o n e s a n d s h o u l d t h e re f or e r a te m o r e h ig hl y.W o r d s e g m e n t a t i o n i s e s s e n ti a l fo r th i s p u r p o s e , b e c a u s e t h e r e l a ti o n s h i p b e t w e e n t h ef r e q u e n c y o f a w o r d a n d t h e f r e q u e n c y o f t h e c h a r a c te r s t h a t a p p e a r w i t h i n i t i s o f t e nv e r y w e a k . W i t h o u t w o r d s e g m e n t a t i o n , t h e p re c i s io n o f t h e r e su l t s e t w i l l b e r e d u c e db e c a u s e r e l e v a n t d o c u m e n t s a r e l e s s li k e l y t o b e c l o s e t o t h e t o p o f t h e li st .

F o r e x a m p l e , t h e w o r d ~ [ ] ( "t o g o a b r o a d " ) is a n i n fr e q u e n t w o r d t h a t a p p e a r so n l y t w e n t y t i m e s i n t h e P H c o r p u s . B u t i ts t w o c h a r a c t e r s o c c u r f r e q u e n t l y : ~ ( " tog o o u t " ) 1 3,5 31 t im e s ; a n d [ ] ( " c o u n t r y " ) 4 5,0 10 t im e s . I n f a c t ~ i s t h e s e c o n d m o s tf r e q u e n t c h a r a c t e r i n t h e c o r p u s , a p p e a r i n g i n 4 4 3 s e p a r a t e w o r d s . C h a r a c t e r - b a s e dr a n k i n g w o u l d p l a c e l i t t l e w e i g h t o n t h e s e t w o c h a r a c t e r s , e v e n t h o u g h t h e y a r e e x -t r e m e l y i m p o r t a n t if t h e q u e r y i s ~ . T h e w o r d - ~ ( " al s o ") is a n o t h e r f r e q u e n tc h a r a c te r , a p p e a r i n g 4 ,5 5 3 t i m e s i n t h e P H c o r p u s . H o w e v e r , i n 4 ,4 81 o f t h o s e c a s e s i ta p p e a r s b y i t s e l f a n d c o n t r i b u t e s l it tl e t o t h e m e a n i n g o f th e t e xt . I f a q u e r y c o n t a i n e db o t h o f t h e s e w o r d s , f a r m o r e w e i g h t w o u l d b e g i v e n to " ~ t h a n t o th e i n d i v i d u a lc h a r a c t e r s i n ~ ~ .

W o r d c o u n t s a l s o g iv e f e e d b a c k o n t h e e f f e c ti v e n e s s o f a q u er y . T h e y h e l p u s e r sj u d g e w h e t h e r t h e ir q u e r y w a s t o o w i d e o r t o o n a r ro w , a n d p r o v i d e i n f o r m a t i o n o nw h i c h o f th e t e r m s a r e m o s t a p p r o p r i a t e .W o r d - b a s e d p r o c e s s i n g i s e s s en t ia l t o a n u m b e r o f e m e r g e n t n e w t e c h n o l o g ie s i nt h e d i g i t a l l i b r a r y f i e l d . S t a t i s t i c a l a p p r o a c h e s a r e e n j o y i n g a r e s u r g e n c e i n n a t u r a ll a n g u a g e a n a l y s i s ( K l a v a n s a n d R e s n i k 1 99 7): e x a m p l e s i n c l u d e t e x t s u m m a r i z a t i o n ,d o c u m e n t c l u s te r i n g , a n d k e y p h r a s e e x t ra c t i o n. A l l o f t h e s e s ta t is t ic a l a p p r o a c h e s a r eb a s e d o n w o r d s a n d w o r d f re q u e n c ie s . F o r i n st a nc e , k e y w o r d s a n d k e y p h r a s e s f o r ad o c u m e n t c a n b e d e t e r m i n e d a u t o m a t i c a l l y b a s e d o n f e a t u r e s s u c h a s t h e f r e q u e n c yo f t h e p h r a s e i n t h e d o c u m e n t r e l a t iv e t o i ts fr e q u e n c y i n a n i n d e p e n d e n t c o r p u s o fl i k e m a te r i a l , a n d i t s p o s i t i o n o f o c c u r r e n c e i n t h e d o c u m e n t ( F r a n k e t a l. 1 99 9) . Ad e c o m p o s i t i o n o f t e x t i n t o i ts c o n s t i t u e n t w o r d s i s a n e s s e n t i a l p r e r e q u i s i t e f o r th ea p p l i c a t i o n o f s u c h t e c h n i q u e s .7 . C o n c l u s i o n sT h e p r o b l e m o f w o r d s e g m e n t a t i o n o f C h i n e s e t e x t i s i m p o r t a n t i n a v a r i e t y o f c o n -t e x t s , p a r t i c u l a r l y w i t h t h e b u r g e o n i n g i n t e r e s t i n d i g i t a l l i b r a r i e s a n d o t h e r s y s t e m st h a t s t o r e a n d p r o c e s s t e x t o n a m a s s i v e s c a l e . E x i s t i n g t e c h n i q u e s a r e e i t h e r l i n g u i s t i -c a l ly b a s e d , u s i n g a d i c t io n a r y o f w o r d s , o r re l y o n h a n d - c r a f t e d s e g m e n t a t i o n r u le s ,o r u s e a d a p t i v e m o d e l s t h a t h a v e b e e n s p e c if ic a l ly c r e a t e d f or th e p u r p o s e o f C h i n e s ew o r d s e g m e n t a t i o n . W e h a v e d e v e l o p e d a n a l te r n a t iv e b a s e d o n a g e n e r a l - p u r p o s ec h a r a c te r -l e v e l m o d e l o f t e x t - - t h e k i n d o f m o d e l s u s e d i n t h e v e r y b e s t t e x t c o m p r e s -s i o n sc h e m e s . T h e s e m o d e l s a r e f o r m e d a d a p t i v e l y f r o m t r a i n in g t ex t.

T h e a d v a n t a g e o f u s i n g c h a r a c t e r- l ev e l m o d e l s i s t h a t t h e y d o n o t r e l y o n a d i c-t i o n a r y a n d t h e r e f o r e d o n o t n e c e s s a r i l y fa i l o n u n u s u a l w o r d s . I n ef fe c t, t h e y c a n f a l lb a c k o n g e n e r a l p r o p e r t i e s o f l a n g u a g e s t a ti s ti c s to p r o c e s s n o v e l te x t. T h e a d v a n t a g eo f b a s i n g m o d e l s o n a c o r p u s o f t ra i n i n g t e x t i s t h a t p a r t i c u l a r c h a r a c t e ri s t ic s o f t h et e xt a re a u t o m a t i c a l l y t a k e n i n t o a c c o u n t i n l a n g u a g e s t a t i st i c s - -a s e x e m p l i f i e d b y t h es ig n i fi c an t d i f fe r e n c e s b e t w e e n t h e m o d e l s f o r m e d f o r t h e P H a n d R o c l i n g c or p o r a.E n c o u r a g i n g r e s u l t s h a v e b e e n o b t a i n e d u s i n g t h e n e w s c h e m e . O u r r e s u l t s c o m -p a r e v e r y f a v o r a b l y w i t h t h e r e s u l ts o f H o c k e n m a i e r a n d B r e w ( 19 98 ) o n th e P Hc o r p u s ; u n f o r t u n a t e l y n o o t h e r r e s e a r c h e r s h a v e p u b l i s h e d q u a n t i t a t i v e r e s u l t s o n a

391


18/20

C o m p u t a t i o n a l L i n g u is t ic s V o l u m e 2 6, N u m b e r 3

s t a n d a r d c o r p u s . F u r t h e r w o r k i s n e e d e d t o a n a l y z e t h e r es u l t s o f t h e R o c l i n g c o r p u si n m o r e d e ta i l.

T h e n e x t s t e p i s t o u s e a u t o m a t i c a l l y s e g m e n t e d t e x t t o i n v e s t ig a t e t h e d i g i t a l li -b r a r y a p p l i c a t i o n s w e h a v e d e s c r ib e d : i n f o r m a t i o n r e tr i e v al , t e x t s u m m a r i z a t i o n , d o c -u m e n t c l u s te r in g , a n d k e y p h r a s e e x t ra c t io n .AcknowledgmentsW e a r e g r a te f u l to S tu a r t I n g l i s , Ho n g Ch e n ,a n d J o h n C l e a r y , w h o p r o v i d e d a d v i c e a n da s s is t an c e . Th e c o r r e c te d v e r s io n o f Gu oJ i n' s P H c o r p u s a n d t h e R o c l in g c o r p u sw e r e p r o v i d e d b y J u li a H o c k e n m a i e r a n dC h r i s B r e w a t t h e U n i v e r s i t y o f E d i n b u r g ha n d t h e C h i n e s e K n o w l e d g e I n f o r m a t i o nPr o c e s s in g Gr o u p o f Ac a d e mia S in ic a ,r e s p e c tiv e ly . Th e I n s t i tu te o f Co m p u ta t io n a lL in g u i s t ic s o f Pe k in g U n iv e r s i ty a l s op r o v id e d s o me t e s t ma te r i a l . B i l l Te a h a na c k n o w l e d g e s t h e g e n e r o u s s u p p o r t o f t h eD e p a r t m e n t o f I n f o r m a t i o n T e c h n o lo g y ,L u n d U n i v e r si t y, S w e d e n . T h a n k s a l s o t ot h e a n o n y m o u s r e f e re e s w h o h a v e h e l p e du s to imp r o v e th e p a p e r s ig n i f i c a n t ly .ReferencesA n d e r s o n , J o h n B . a n d S e s h a d r i M o h a n .1984 . Se qu e n t i a l c o d in g a lg o r i th ms : As u r v e y a n d c o s t a n a ly s i s . IEEE Transactionson Com munications, 32(2):169-176.B e ll , T im o th y C . , Jo h n G . C le a ry , a n d I a n H .W i t te n . 1990 . Text Com pression. Pr e n t i c eHal l , Englewood Cl i f f s , NJ .Br il l, Er ic . 1995. Tra ns form at ion -base de r r o r - d r iv e n l e a rn i n g a n d n a t u r a lp r o c e s s in g : A c a s e s tu d y in p a r t - o f - s p e e c hta g g in g . Computational Linguistics,21(4):543-565.Brogl io , John , Jamie P. Ca l lan , and W. Bruce

Cr o f t. 1994 . I n q u e r y s y s te m o v e r v ie w . InProceedings TIP STE R Text Program (P hase I).M o r g a n K a u f m a n n , S a n F r a n ci sc o , C A ,p a g e s 47 - 67 .Brogl io , John , Jamie P. Ca l lan , an d W. BruceCrof t . 1996. Techn ica l i s sues in b u i ld inga n i n f o r m a t i o n s y s t e m f o r C h i n e s e. CIIRTechnical Report IR-86 , C e n t e r f o rI n te l l ig e n t I n f o r ma t io n Re t r i e v a l,Un iv e r s i ty o f Ma s s a c h u s e t t s , Amh e r s t .Ch e n g , K wo k - Sh in g , G i lb e r t H . Y o u n g , a n dK a m - Fa i W o n g . 1999 . A s tu d y o nw o r d - b a s e d a n d i n t eg r a l- b it C h i n e s e t e x tc o m p r e s s i o n a l g o r i th m s . Journal of theAmerican Society for Information Science,50(3):218-228.C le a r y , Jo h n G . a n d I a n H . W i t t e n . 1984 .D a t a c o m p r e s s i o n u s i n g a d a p t i v e c o d i n ga n d p a r t i a l s t r in g ma tc h in g . IEEETransactions on Comm unications,

32(4):396-402.C o r m e n , T h o m a s H . , C h a r le s E . L e i se r so n ,a n d Ro n a ld L . R iv e s t . 1990 . Introduction toAlgorithms. M I T P r es s , C a m b r i d g e , M A .D a i , Y u bin, C h r i s to p h e r S . G . K h o o , a n dTe c k Ee Lo h . 1999 . A n e w s ta ti s ti c a lf o r m u l a f o r C h i n e s e t e x t s e g m e n t a t i o ni n c o r p o r a t i n g c o n t e x t u a l i n f o rm a t i o n . I nProceedings of A C M SIGIR99, p a g e s 82 - 89 .Fr a k e s, W i l li a m B . 1992 . S te m m in ga lg o r i th ms . I n W i l li a m B . F r a k e s a n dRica rdo Baeza-Yates , ed i to r s , InformationRetrieval: D ata Structures and Algorithms.Pr e n t i c e Ha l l , En g le w o o d C l if fs , NJ , p a g e s131-160.Fr a n k , E ., Go r d o n W . Pa y n te r , I a n H .W i t t en , C a r l G u t w i n , a n d C r a i g G .Ne v i l l - Ma n n in g . 1999 . D o m a in - s p e c i fi ck e y p h r a s e e x t r a c t io n . I n Proceedings of theInternational Joint C onference on ArtificialIntelligence, pag es 668--673. S tock holm ,S w e d e n . M o r g a n K a u f m a n n , S a nFranc isco , CA.Ho c k e n m a ie r , Ju l ia a n d Ch r i s B r e w. 1998 .E r r o r d r i v e n s e g m e n t a t i o n o f C h i n e se .Communications o f CO LIP S, v o l u m e 1 ,n u m b e r 1 , p a g e s 6 9 - 8 4.H o w a r d , P a u l G l o r. 1 99 3 . The Design andAna lysis of Efficient Lossless DataCompression Systems. Ph . D th e s i s , B r o wnUn iv e r s i ty , P r o v id e n c e , RI.K la v a n s , Ju d i th L . a n d Ph i l ip Re s n ik ,e d i to r s . 1997 . Th e B a la n c in g Ac t :C o m b i n i n g S y m b o l i c a n d S t a ti st ic a lA p p r o a c h e s t o L a n g u a g e . M I T P r e ss ,C a m b r i d g e , M A .K u c e r a , He n r y a n d W . Ne l s o n Fr a n c i s . 1967 .Computational Analysis of Present-DayAmerican English. B r o w n U n i v e r s i t y P r e s s,P r o v id e n c e , RI.L iu , Y o n g qu a n . 1987 . N e w a d v a n c e s inc o m p u t e r s a n d n a t u r a l l a n g u a g ep r o c e s s in g in Ch in a . Information Science,8:64-70.L e e , K i n H o n g , M a u K i t M i c h a e l N g , a n dQ i n L u . 1 9 9 9 . T e x t s e g m e n t a t i o n f o rCh in e s e s p e l l c h e c k in g . Journal of theAmerican Society for Information Science,50(9):751-759.Pa lme r , D a v id . 1997 . A t r a in a b le r u le - b a s e da l g o r i t h m f o r w o r d s e g m e n t a t i o n . InProceedings of the 35 h An nua l Mee ting o f theAssociation for Computational Linguistics

3 9 2


19/20

T e a h a n , W e n , M c N a b , a n d W i tt e n C h i n e s e W o r d S e g m e n t a t i o n

(ACL "97) , M adr id , 1997.Po n t e , J a y M . a n d W . Br u c e C r o f t. 1996.U S e g : A r e t a r g e t a b l e w o r d s e g m e n t a t i o np r o c e d u r e f o r i n f o r m a t i o n r e t ri e v a l.P r e s e n te d a t th e S y m p o s i u m o nD o c u m e n t A n a l y si s a n d I n f o r m a t i o nRet r ieva l '96 (SDAIR) . UMass Technica lR e p o r t T R 9 6 -2 , U n i v e r s i t y o fM a s s a c h u s e t t s , A m h e r s t , M A .R a b i n e r , L a w r e n c e R . 1989 . A t u t o r i a l o nh i d d e n M a r k o v m o d e l s a n d s e le c te da p p l i c a t i o n s i n s p e e c h r e c o g n i t i o n . I nProceedings of the IE E E, 77(2):257-286.Sproa t , R ichard , Chi l in Sh ih , Wi l l i am Gai l ,a n d N a n c y C h a n g . 1 9 9 6. A s t o c h a s t icf i n it e -s t at e w o r d - s e g m e n t a t i o n a l g o r i t h mf o r C h i n e s e . Computational Linguistics,22(3):377--404.Teah an , W. J . 1998. Modelling English Text.Ph .D t h e s i s , U n i v e r s i t y o f W a i k a t o , N Z .Teah an , W. J. , S . Ing l i s , John G. Cleary , a ndG . H o l m e s . 1998 . C o r r e c t i n g E n g l i s h t e x tu s i n g P P M m o d e l s . I n J. A . S t o r er a n d

J. H. Reif , edi tors , Proceeding D ataCompression Conference, p a g e s 2 89-2 98 , L o sA l a m i t o s , C A . I EE E C o m p u t e r S o c i e tyPress .W i t t e n , I a n H . , A l i s t a r M o f f a t , a n d T i m o t h yC. Bell. 1999. Ma naging gigabytes:compressing and indexing docu ments andimages. S e c o n d e d i t i o n . M o r g a nK a u f m a n n , S a n F r a n c is c o , C A .W u , D e k a i. 1 9 9 8 . A p o s i t i o n s t a t e m e n t o nC h i n e s e s e g m e n t a t i o n . P r e s e n t e d a t t h eChinese Language P rocessing Workshop,U n i v e r s i t y o f P e n n s y l v a n i a , P A .W u , De k a i a n d Pa s c a l e Fu n g . 1994 .I m p r o v i n g C h i n e s e t o k e n iz a t io n w i t hl ingu is t i c f i l te r s on s ta t i s t ica l l ex ica lacquis i t ion . In ANLP -94, Fourth Conferenceon Ap plied Natural Language Processing,pages 180-181, S tu t tgar t , Oc tober 94 .W u , Z i m i n a n d G w y n e t h T s e n g . 1 9 9 3 .C h i n e s e t e x t s e g m e n t a t i o n f o r te x tr e t r i e v a l a c h i e v e m e n t s a n d p r o b l e m s .JASIS, 44(9):532-542.

3 9 3


20/20

Documents

A Compression-Based Algorithm for Chinese Word Segmentation