Rethinking Chinese Word Segmentation:
Tokenization, Character Classification, or Wordbreak Identification
黃居仁 Chu-Ren HuangAcademia Sinica
http://cwn.ling.sinica.edu.tw/huang/huang.htm
April 11, 2007,Hong Kong Polytechnic University
Citation
• Please note that this is our ongoing work that will be presented later as
Chu-Ren Huang, Petr Šimon, Shu-Kai Hsieh and Laurent Prévot. 2007. Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification. To appear in the proceedings of the 2007 ACL Annual Meeting.
Outline
• Introduction: modeling and theoretical challenges
• Previous Models– Segmentation as Tokenization– Character classification model
• A radical model
• Implementation and experiment
• Conclusion/Implications
Introduction: modeling and theoretical challenges
• Back to the basics: The goal of Chinese word segmentation is to identify wordbreaks– Such that these segmented units can be used
as processing units (i.e. words)
• Crucially– Words are not identified before segmentation– Wordbreaks in Chinese fall at character-
breaks only, and at no other places
Challenge I
Segmentation is the pre-requisite task for all Chinese processing applications, hence a realistic solution of segmentation must be
• Robust: perform consistently regardless of language variations
• Scalable: be applicable to all variants of Chinese and requires minimal training
• Portable: applicable for real time processing to all kinds of texts, all the time,
Challenge II
Chinese speakers perform segmentation subconsciously without mistakes, hence if we simulate human segmentation, it must :
• Be Robust, Sharable, Portable
• Not assume prior lexical knowledge
• Equally sensitive to known and unknown words
So Far
Not so good
• All exiting algorithms perform reasonably well but require– Large set of training data– Long training time– Comprehensive lexicon– And the training process must be repeated with every
new variant (topic/style/genre)
But Why?
Previous Models I Segmentation as Tokenization
The Classical Model
(Chen and Liu 1992 etc.)
• Segmentation is interpreted as identification of tokens (e.g. words) in a text, hence contains two steps– Dictionary Lookup– Unknown Word (or OOV) Resolution
Segmentation as Tokenization 2
• Find all sequences Ci, …Ci+m such that [Ci, …Ci+m] is a token iff– it is an entry in the lexicon, or – It not a lexical entry but is predicted to be so b
y a unknown word resolution algorithm
• Ambiguity Resolution: when there is a Cj, such that both [x, Cj, y] and [y, Cj, z] are entries in the lexicon
Segmentation as Tokenization 3
• High Complexity: – mapping tens of thousand of lexical entries to
even more possible matching strings– Overlapping ambiguity estimated to be up to
20% depending on texts and lexica
• Not Robust– Dependent on lexicon (and lexica are
notoriously easy to change and expensive to build
– OOV?
Previous Models II:Character ClassificationCurrently Popular Model
(Xue 2003, Gao et al. 2004)
• Segmentation is re-interpreted as classification of character positions. – Classify and tag each character according to its positi
on in a word (initial, final, middle etc.)– Learn the distribution of such classification from a cor
pus– Predict segmentation based on positional classificatio
n of a character in a string
Character Classification 2
• Character Classification:– Each character Ci is associated with a 3-tuple Ci: <Inii,
Midi, Fin i> where Inii, Midi, Fini are the probability for Ci, to be in Initial, Middle, or Final positions respectively.
• Ambiguity Resolution: – Multiple classification of a character: A character does
not occur exclusively as initial or final etc.– Conflicting classifications of neighboring characters.
Character Classification 3
• Less Complexity:– 6,000 characters x 3 to 10 positional classes
• Higher Performance: 97% f-score on SigHAN bakeoff (Huang and Zhao 2006)
Character Classification 4
Inherent Modeling Problems• Segmentation becomes a second order decision
dependent on first order decision on character classification– Unnecessary complexity involved– Inherent ceiling set (segmentation cannot outperform
character classification)
• Still highly dependent on lexicon– Character positions must be defined with prior lexical
knowledge of a word
Our New Proposal
Naïve but Radical• Segmentation is nothing but segmentation
– Possible segmentation sites are well-defined without ambiguity. They are simply the character-breaks clearly marked in any text.
– The task is simply to identify all CB which also function as Wordbreak (WB)
– Based on distributional information extracted from the contexts surrounding CB’s (i.e. characters)
Simple Formalization
• Any Chinese text is envisioned as a sequence characters-breaks CB’s, evenly distributed among a sequence of characters c’s.
CB0 c1 CB1 c2...CBi-1 ci CBi...CBn-1 cn CBn
• NB: Psycholinguistic experiment with eye-tracking machine shows that eyes can fix on edges of a character when reading Chinese. (J.L. Tsai, p.c.)
How to Model Distributional Information of blanks?
• There is no overt difference between CB’s and WB’s. Unlike English, where the CB spaces are s
mall, but the WB spaces are BIG.– Hence distributional information must come from the context.
• CB0 c1 CB1 c2...CBi-1 ci CBi...CBn-1 cn CBn
– Overtly, CB’s carry no distributional Info.– However, c’s do carry information about the st
atus of a CB/WB in its neighborhood (based on a tagged corpus, or human experience)
Range of Relevant Context
CBi-2 CBi-1 ci CBi+1 CBi+2
• Recall that CB’s carry no overt information, while c’s do.• Linguistically, it is attested that initial, final, second, and
penultimate positions are morphologically significant. – In other words, a linguistic element can carry explicit i
nformation about immediately adjacent CB’s as well the CB’s immediately adjacent to the above two
• 2CB-Model: Taking all the immediate ones• 4CB-Model: Taking two more
Collecting Distributional Information
CBi-2 CBi-1 ci CBi+1 CBi+2
• Adopt either 2CBM or 4CBM• Collect a 2-tuple or 4-tuple for each character from a segmented cor
pus• Sum up the n-tuple value for all tokens belong to the same characte
r type to form a distributional vector
Character V1 V2 V3 V4
的 0.0127 0.9866 0.9917 0.0081
一 0.1008 0.8744 0.6500 0.2819
是 0.1902 0.8051 0.9708 0.0286
不 0.0683 0.9055 0.4653 0.4657
有 0.2491 0.7397 0.8408 0.1253
Table 2. Character table for 4CBM
Estimating Distributional Features of CB’s
c-2 c-1 CB c+1 c+2
• For each CB, distributional information is contributed by 2 or 4 adjacent characters
• Each characters carry the four-element vector given above, align the vector positions and then sum up
• Note that no knowledge from a lexicon is involved (while the character classification model is making explicit decision of the position of that character in a word)
Aligning Vector Positions
c-2 c-1 CB c+1 c+2
c-2< V1, V2, V3, V4 >
c-1< V1, V2, V3, V4 >
c+1< V1, V2, V3, V4 >
c+2< V1, V2, V3, V4 >
Theoretical Issues in Modeling
• Do we look beyond WB’s (in 4CBM)?– No, characters cannot contribute to boundary
conditions beyond an existing boundary.– Yes, we cannot assume lexical knowledge a priori
(and the model is more elegant)
• One or Two features (in 4CBM)?– No, positive information (that there is a WB) and
negative (that there is no WB) should be complimentary
– Yes (especially when the answer to the above Q is no), since there are under-specified cases
Size of Distributional Info
• The Sinica Corpus 5.0 contains 6820 types of c’s (characters, numbers, punctuation, Latin alphabet etc.)
• The 10 million word corpus is converted into 14.4 million labeled CB vectors.
• In this first study we implement a CB only model, without any preprocessing of punctuation marks.
How to Model Decision I
• Assuming that each character represents an independent event, hence all relevant vectors can be summed up and evaluated– Simple heuristic by sum and threshold– Decision Tree trained on segmented corpus– Machine-learning trained on segmented
corpus?
Simple Sum and Threshold Heuristic
• Mean for sums of CB vectors for each S and -S (mean probability of S = 2.90445651112, -S = 1.89855870063)
• One standard deviation difference between each CB vector and threshold values was used as a segmentation heuristics
• 88% accuracy• Error analysis: CB vectors are not linearly
separable
Decision Tree
• A decision tree classifier (YaDT, Ruggieri2004) is adopted
• on a 900,000 CB vectors sample of 100,000 boundary vectors for testing phase.
• Achieves up to 97% accuracy in inside test, including numbers, punctuation and foreign words.
Evaluation: SigHAN Bakeoff
• Note that our method is NOT designed for SigHAN bakeoff, where resources are devoted to fine-tune for the small extra edge in scoring
• This radical model aims to be robust in a real world situation, where it can perform reliably without extra tuning when encountered different texts
• No manual pre-processing, texts input as seen
Evaluation
• Closed test, but without any lexical knowledge
Discussion
• The method is basically sound • We still need to develop an effective
algorithm for adaptation to new variants• Automatic pre-processing on punctuation
marks and foreign symbols should improve the performance
• What role should lexical knowledge play?• The character as independent event
assumption may be incorrect
How to Model Decision II
• Assuming that a string of characters are not independent events, hence certain combinations (as well as single characters) can contribute to WB decision.
• One possible implementation: c’s as committee members, decision by vote– Five voting blocks by simple majority:
c-2 c-1, c-1, c+1 c-1, c+1, c+1 c+2
c-2 c-1 CB c+1 c+2
Conclusion I
• We propose a radical but elegant model for Chinese Word Segmentation
• Where the task is reduce to binary classification of CB’s into WB’s and non WB’s
• The model does not pre-suppose and lexical knowledge and relies only on distributional information of characters as the context for CB’s
Conclusion II
• In principle, this model should be robust and scalable for all different variants of texts
• Preliminary experiment result is promising yet leave rooms for improvement
• Work is still on-going
• You are welcomed to adopt this model and experiment with your favorite algorithm!