Download ppt - Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification 黃居仁 Chu-Ren Huang Academia Sinica

Rethinking Chinese Word Segmentation:

Tokenization, Character Classification, or Wordbreak Identification

黃居仁 Chu-Ren HuangAcademia Sinica

http://cwn.ling.sinica.edu.tw/huang/huang.htm

April 11, 2007,Hong Kong Polytechnic University

http://cwn.ling.sinica.edu.tw/huang/huang.htm

Citation

• Please note that this is our ongoing work that will be presented later as

Chu-Ren Huang, Petr Šimon, Shu-Kai Hsieh and Laurent Prévot. 2007. Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification. To appear in the proceedings of the 2007 ACL Annual Meeting.

Outline

• Introduction: modeling and theoretical challenges

• Previous Models– Segmentation as Tokenization– Character classification model

• A radical model

• Implementation and experiment

• Conclusion/Implications

Introduction: modeling and theoretical challenges

• Back to the basics: The goal of Chinese word segmentation is to identify wordbreaks– Such that these segmented units can be used

as processing units (i.e. words)

• Crucially– Words are not identified before segmentation– Wordbreaks in Chinese fall at character-

breaks only, and at no other places

Challenge I

Segmentation is the pre-requisite task for all Chinese processing applications, hence a realistic solution of segmentation must be

• Robust: perform consistently regardless of language variations

• Scalable: be applicable to all variants of Chinese and requires minimal training

• Portable: applicable for real time processing to all kinds of texts, all the time,

Challenge II

Chinese speakers perform segmentation subconsciously without mistakes, hence if we simulate human segmentation, it must :

• Be Robust, Sharable, Portable

• Not assume prior lexical knowledge

• Equally sensitive to known and unknown words

So Far

Not so good

• All exiting algorithms perform reasonably well but require– Large set of training data– Long training time– Comprehensive lexicon– And the training process must be repeated with every

new variant (topic/style/genre)

But Why?

Previous Models I Segmentation as Tokenization

The Classical Model

(Chen and Liu 1992 etc.)

• Segmentation is interpreted as identification of tokens (e.g. words) in a text, hence contains two steps– Dictionary Lookup– Unknown Word (or OOV) Resolution

Segmentation as Tokenization 2

• Find all sequences Ci, …Ci+m such that [Ci, …Ci+m] is a token iff– it is an entry in the lexicon, or – It not a lexical entry but is predicted to be so b

y a unknown word resolution algorithm

• Ambiguity Resolution: when there is a Cj, such that both [x, Cj, y] and [y, Cj, z] are entries in the lexicon

Segmentation as Tokenization 3

• High Complexity: – mapping tens of thousand of lexical entries to

even more possible matching strings– Overlapping ambiguity estimated to be up to

20% depending on texts and lexica

• Not Robust– Dependent on lexicon (and lexica are

notoriously easy to change and expensive to build

– OOV?

Previous Models II:Character ClassificationCurrently Popular Model

(Xue 2003, Gao et al. 2004)

• Segmentation is re-interpreted as classification of character positions. – Classify and tag each character according to its positi

on in a word (initial, final, middle etc.)– Learn the distribution of such classification from a cor

pus– Predict segmentation based on positional classificatio

n of a character in a string

Character Classification 2

• Character Classification:– Each character Ci is associated with a 3-tuple Ci: <Inii,

Midi, Fin i> where Inii, Midi, Fini are the probability for Ci, to be in Initial, Middle, or Final positions respectively.

• Ambiguity Resolution: – Multiple classification of a character: A character does

not occur exclusively as initial or final etc.– Conflicting classifications of neighboring characters.


• Less Complexity:– 6,000 characters x 3 to 10 positional classes

• Higher Performance: 97% f-score on SigHAN bakeoff (Huang and Zhao 2006)


Inherent Modeling Problems• Segmentation becomes a second order decision

dependent on first order decision on character classification– Unnecessary complexity involved– Inherent ceiling set (segmentation cannot outperform

character classification)

• Still highly dependent on lexicon– Character positions must be defined with prior lexical

knowledge of a word

Our New Proposal

Naïve but Radical• Segmentation is nothing but segmentation

– Possible segmentation sites are well-defined without ambiguity. They are simply the character-breaks clearly marked in any text.

– The task is simply to identify all CB which also function as Wordbreak (WB)

– Based on distributional information extracted from the contexts surrounding CB’s (i.e. characters)

Simple Formalization

• Any Chinese text is envisioned as a sequence characters-breaks CB’s, evenly distributed among a sequence of characters c’s.

CB0 c1 CB1 c2...CBi-1 ci CBi...CBn-1 cn CBn

• NB: Psycholinguistic experiment with eye-tracking machine shows that eyes can fix on edges of a character when reading Chinese. (J.L. Tsai, p.c.)

How to Model Distributional Information of blanks?

• There is no overt difference between CB’s and WB’s. Unlike English, where the CB spaces are s

mall, but the WB spaces are BIG.– Hence distributional information must come from the context.

• CB0 c1 CB1 c2...CBi-1 ci CBi...CBn-1 cn CBn

– Overtly, CB’s carry no distributional Info.– However, c’s do carry information about the st

atus of a CB/WB in its neighborhood (based on a tagged corpus, or human experience)

Range of Relevant Context

CBi-2 CBi-1 ci CBi+1 CBi+2

• Recall that CB’s carry no overt information, while c’s do.• Linguistically, it is attested that initial, final, second, and

penultimate positions are morphologically significant. – In other words, a linguistic element can carry explicit i

nformation about immediately adjacent CB’s as well the CB’s immediately adjacent to the above two

• 2CB-Model: Taking all the immediate ones• 4CB-Model: Taking two more

Collecting Distributional Information

CBi-2 CBi-1 ci CBi+1 CBi+2

• Adopt either 2CBM or 4CBM• Collect a 2-tuple or 4-tuple for each character from a segmented cor

pus• Sum up the n-tuple value for all tokens belong to the same characte

r type to form a distributional vector

Character V1 V2 V3 V4

的 0.0127 0.9866 0.9917 0.0081

一 0.1008 0.8744 0.6500 0.2819

是 0.1902 0.8051 0.9708 0.0286

不 0.0683 0.9055 0.4653 0.4657

有 0.2491 0.7397 0.8408 0.1253

Table 2. Character table for 4CBM

Estimating Distributional Features of CB’s

c-2 c-1 CB c+1 c+2

• For each CB, distributional information is contributed by 2 or 4 adjacent characters

• Each characters carry the four-element vector given above, align the vector positions and then sum up

• Note that no knowledge from a lexicon is involved (while the character classification model is making explicit decision of the position of that character in a word)

Aligning Vector Positions

c-2 c-1 CB c+1 c+2

c-2< V1, V2, V3, V4 >

c-1< V1, V2, V3, V4 >

c+1< V1, V2, V3, V4 >

c+2< V1, V2, V3, V4 >

Theoretical Issues in Modeling

• Do we look beyond WB’s (in 4CBM)?– No, characters cannot contribute to boundary

conditions beyond an existing boundary.– Yes, we cannot assume lexical knowledge a priori

(and the model is more elegant)

• One or Two features (in 4CBM)?– No, positive information (that there is a WB) and

negative (that there is no WB) should be complimentary

– Yes (especially when the answer to the above Q is no), since there are under-specified cases

Size of Distributional Info

• The Sinica Corpus 5.0 contains 6820 types of c’s (characters, numbers, punctuation, Latin alphabet etc.)

• The 10 million word corpus is converted into 14.4 million labeled CB vectors.

• In this first study we implement a CB only model, without any preprocessing of punctuation marks.

How to Model Decision I

• Assuming that each character represents an independent event, hence all relevant vectors can be summed up and evaluated– Simple heuristic by sum and threshold– Decision Tree trained on segmented corpus– Machine-learning trained on segmented

corpus?

Simple Sum and Threshold Heuristic

• Mean for sums of CB vectors for each S and -S (mean probability of S = 2.90445651112, -S = 1.89855870063)

• One standard deviation difference between each CB vector and threshold values was used as a segmentation heuristics

• 88% accuracy• Error analysis: CB vectors are not linearly

separable

Decision Tree

• A decision tree classifier (YaDT, Ruggieri2004) is adopted

• on a 900,000 CB vectors sample of 100,000 boundary vectors for testing phase.

• Achieves up to 97% accuracy in inside test, including numbers, punctuation and foreign words.

Evaluation: SigHAN Bakeoff

• Note that our method is NOT designed for SigHAN bakeoff, where resources are devoted to fine-tune for the small extra edge in scoring

• This radical model aims to be robust in a real world situation, where it can perform reliably without extra tuning when encountered different texts

• No manual pre-processing, texts input as seen

Evaluation

• Closed test, but without any lexical knowledge

Discussion

• The method is basically sound • We still need to develop an effective

algorithm for adaptation to new variants• Automatic pre-processing on punctuation

marks and foreign symbols should improve the performance

• What role should lexical knowledge play?• The character as independent event

assumption may be incorrect

How to Model Decision II

• Assuming that a string of characters are not independent events, hence certain combinations (as well as single characters) can contribute to WB decision.

• One possible implementation: c’s as committee members, decision by vote– Five voting blocks by simple majority:

c-2 c-1, c-1, c+1 c-1, c+1, c+1 c+2

c-2 c-1 CB c+1 c+2

Conclusion I

• We propose a radical but elegant model for Chinese Word Segmentation

• Where the task is reduce to binary classification of CB’s into WB’s and non WB’s

• The model does not pre-suppose and lexical knowledge and relies only on distributional information of characters as the context for CB’s

Conclusion II

• In principle, this model should be robust and scalable for all different variants of texts

• Preliminary experiment result is promising yet leave rooms for improvement

• Work is still on-going

• You are welcomed to adopt this model and experiment with your favorite algorithm!