Minimum tag error for discriminative training of conditional random fields

Information Sciences 179 (2009) 169–179

Contents lists available at ScienceDirect

Information Sciences

journal homepage: www.elsevier .com/locate / ins

Minimum tag error for discriminative training of conditional random fields

Ying Xiong *, Jie Zhu, Hao Huang, Haihua XuDepartment of Electronic Engineering, Shanghai Jiao Tong University, 1164# Shanghai Jiao Tong University, No. 800 Dong Chuan Road, Min Hang,Shanghai 200240, China

a r t i c l e i n f o a b s t r a c t

Article history:Received 26 May 2008Received in revised form 17 September2008Accepted 18 September 2008

Keywords:Natural language processingMachine learningConditional random fieldsDiscriminative trainingChinese word segmentation

0020-0255/$ - see front matter � 2008 Elsevier Incdoi:10.1016/j.ins.2008.09.018

* Corresponding author.E-mail addresses: [email protected] (Y. Xion

1 To facilitate the use of terminology, the term ‘‘w

This paper proposes a new criterion called minimum tag error (MTE) for discriminativetraining of conditional random fields (CRFs). The new criterion, which is a smoothedapproximation to the sentence labeling error, aims to maximize an average of transcriptiontagging accuracies of all possible sentences, weighted by their probabilities. Corpora fromthe second international Chinese word segmentation bakeoff (Bakeoff 2005) are used totest the effectiveness of this new training criterion. The experimental results have demon-strated that the proposed minimum tag error criterion can reliably improve the initialperformance of supervised conditional random fields. In particular, the recall rate of out-of-vocabulary words (Roov) is significantly improved compared with that obtained usingstandard conditional random fields. Furthermore, the new training method has the advan-tage of robustness to segmentation across all datasets.

� 2008 Elsevier Inc. All rights reserved.

1. Introduction

Conditional random fields (CRFs) [14] have recently become popular as models for sequence labeling tasks because theyoffer several advantages over traditional generative models such as hidden Markov models (HMM). Because CRFs are basi-cally defined as conditional models of label sequences given observation sequences, they can make use of flexible overlap-ping features and overcome label bias problems. In recent years, CRFs have been successfully applied to many tasks, such asgene identification [11], spoken language understanding (SLU) [10], part-of-speech (POS) tagging [14], name entity recogni-tion (NER) [3,17], and shallow parsing [23], especially Chinese word segmentation [6,21].

Unlike English and other Western languages, the Chinese language is based on characters rather than words. There are noblank spaces between words in Chinese sentences. Word segmentation is the first step in Chinese language processing taskssuch as information retrieval (IR) [2], text mining (TM) [4,31], question answering (QA) [18], and machine translation (MT)[15,26]. The goal of Chinese word segmentation is to segment a Chinese sentence into a sequence of meaningful words. Inearly decades, methods for Chinese word segmentation mainly focused on dictionary-based approaches which matched in-put sentences against a given dictionary. However, the ‘‘word” in Chinese is actually not a well-defined concept, and no gen-erally accepted lexicon exists. Furthermore, different tasks may have different granularities for defining Chinese wordsegmentation. In computer applications, ‘‘segmentation units” receive more attention than ‘‘words” [29]. For example,

(parallel computer)” may be segmented as two segmentation units (parallel/computer)” inan information retrieval task, but may be regarded as one unit in a key word extraction task.1 Moreover, new words comeinto being all the time. Because of such factors, statistically based methods are the mainstream approach to Chinese word

. All rights reserved.

g), [email protected] (J. Zhu), [email protected] (H. Huang), [email protected] (H. Xu).ords” will be used to mean ‘‘segmentation units” in the rest of this paper.

mailto:[email protected]




http://www.sciencedirect.com/science/journal/00200255

http://www.elsevier.com/locate/ins

170 Y. Xiong et al. / Information Sciences 179 (2009) 169–179

segmentation, especially supervised machine learning methods such as HMM, maximum entropy (ME) [30], and CRFs [14].Unlike dictionary-based methods, machine learning methods rely on statistical models which are learned automaticallyfrom corpora, making them more adaptive and robust in processing different corpora. In the recent SIGHAN Bakeoff compe-titions [5,16], CRFs were widely used and outperformed other machine learning methods such as HMM, support vector ma-chine (SVM) [7], and ME. However, little work has been done on training criteria for CRFs. In this paper, a new trainingcriterion is presented which achieves further improvement in the performance of CRFs without adding to the commonlyused unigram and bigram features.

CRF parameters were first estimated using the maximum log-likelihood (ML) criterion [14]. However, the ML criterion isprone to overfitting because CRFs are often trained with a very large number of overlapping features. The maximum a pos-teriori (MAP) criterion was then proposed in [23] to reduce overfitting. Large margin methods have also been applied toparameter optimization [1,25,27]. Furthermore, the minimum classification error (MCE) criterion, on which the speechand pattern recognition research communities often focused, was adapted to CRF parameter estimation [24]. Gross et al.[8] proposed a training procedure that maximized per-label predictive accuracy in [8]. The procedure was similar to MCEexcept that it was based on a pointwise loss function rather than a sequential loss function.

These training criteria achieved excellent performances on various tasks. For the task of sequence labeling, ideally a CRFmodel is desirable because it can provide high accuracy when labeling new sequences. However, it is difficult to find param-eters which provide the best possible accuracy on training data. In particular, to maximize sequence tagging accuracy, whichis measured by the number of correct labels, gradient-based optimization methods cannot be used directly. Therefore, otheroptimization methods such as those mentioned above are used.

This paper presents a new discriminative training criterion called minimum tag error (MTE) which can be seen as being inthe same spirit as MCE, but with a different objective function that is more naturally applicable to the sequence labeling task.The MTE criterion is a smoothed approximation to the tag accuracy measured on the output of a sequence labeling systemgiven the training data, which can be directly optimized by the gradient-based method without providing a smoothing func-tion as with the MCE criterion. The effectiveness of this new criterion is tested on the Chinese word segmentation task be-cause Chinese word segmentation is a prerequisite step for Chinese information processing. The experimental resultspresented here show that the proposed new criterion can reliably enhance the initial results yielded by the MAP trainedmodel. Furthermore, the new approach has the advantage of being able to recognize out-of-vocabulary (OOV) words (i.e.,the set of words in the test corpus which do not occur in the training corpus) better than the standard MAP training method.

The remainder of this paper is structured as follows: Section 2 reviews standard conditional random fields. The main fo-cus is in Section 3 which introduces the MTE training method. Section 4 describes the experiments, Section 5 presents a dis-cussion of the results, and Section 6 states conclusions. Finally, acknowledgements are expressed.

2. Conditional random fields

Let X = hX1,X2, . . . ,XRi be observation input data sequences to be labeled, and let Y = hY1,Y2, . . . ,YRi be a set of correspondinglabel sequences, where R is the number of data sequences. All components of Yi (i = 1,2, . . . ,R) are assumed to range over afinite tag set T. For example, X might consist of unsegmented Chinese sentences, and Y might range over the boundary tags ofthese sentences, with T a set of boundary tags such as the commonly used BIO (‘‘B” means the beginning of a word, ‘‘I” indi-cates a character other than the beginning of a word and ‘‘O” represents a single-character word). CRF models define the con-ditional probability of a particular label sequence Y = hy1,y2, . . . ,yni given the observation sequence X = hx1,x2, . . . ,xni as

pðY jXÞ ¼ 1ZX

expXn

i¼1

Xk

kkfkðyi�1; yi;X; iÞ !

; ð1Þ

ZX ¼X

Y

expXn

i¼1

Xk


; ð2Þ

where ZX is a normalization factor over all state sequences, fk(yi�1,yi,X, i) is an arbitrary feature function which can be definedas a transitional function for the state pair (yi�1,yi) or a state function for the state–observation pair (yi,X), kk is a learnedparameter associated with feature fk, and n is the length of the sequence.

For simplicity of presentation, the expressions defined in [23] are used as follows:The global feature vector for X and Y is given as

FðY;XÞ ¼X

i

f ðy;X; iÞ; ð3Þ

where i ranges over input positions. The conditional distribution of CRF can be rewritten as

pkðYjXÞ ¼expðk � FðY ;XÞÞPY expðk � FðY ;XÞÞ ¼

expðk � FðY;XÞÞZkðXÞ

: ð4Þ

The parameter vector k can be estimated by the maximum log-likelihood method. Sha and Pereira have proven in [23] thatthe L-BFGS algorithm can converge much faster when training CRF models. To reduce overfitting, the log-likelihood functionis often penalized by a Gaussian prior distribution over the parameters which can be written as

Y. Xiong et al. / Information Sciences 179 (2009) 169–179 171

pðkÞ / exp � kk k2

2r2

!: ð5Þ

Therefore, the CRF objective function with the penalized term can be expressed as follows:

FCRFk ¼

XR

r¼1

log pkðYr jXrÞ þ log pðkÞ ¼XR

r¼1

log pkðYr jXrÞ �kk k2

2r2 : ð6Þ

Given a model as defined in Eq. (4), the most probable label sequence Y* for an input X can be efficiently calculated by meansof the Viterbi algorithm:

Y� ¼ arg maxY

pkðYjXÞ ¼ arg maxY

k � FðY ;XÞ: ð7Þ

The popularity of CRFs stems both from their ability to relax the strong independence assumption made in generative modelssuch as HMM and from their ability to avoid the label bias problems exhibited by maximum entropy Markov models(MEMM). Ideally, it is desirable to obtain a CRF model that provides high accuracy when labeling new sequences. However,conventional CRF models are trained with ML or MAP criteria which do not include accuracy information. The sequence se-lected by these models is therefore not guaranteed to be high accuracy. For this reason, sentence tagging accuracy should beincorporated directly in the training procedure. Therefore, this paper presents a MTE training method that is explicitly inte-grated with sentence tagging accuracy using a label accuracy function which will be introduced in the following section.

3. Minimum tag error training

3.1. Motivation

After the maximum mutual information (MMI) criterion was first successfully applied to automatic speech recognition(ASR), there has been growing interest in a class of error minimizing discriminative training criteria. The minimum phoneerror (MPE) criterion [19,20] is one of the most attractive discriminative training techniques. In contrast to the traditionalMMI criterion, which directly maximizes the posterior probability of the training utterances, the MPE training approach triesto optimize the parameters of an acoustic model by minimizing the expected phone error rate on the training data [13]. Theobjective function of the MPE can be written as follows [19,20]:

FMPEk ¼

XR

r¼1

Xs2S

Pjk ðsjorÞAðs; srÞ ¼

XR

r¼1

Ps2Spkðor jsÞjpðsÞjAðs; srÞP

u2Spkðor juÞjpðuÞj; ð8Þ

where k represents the HMM parameters, S is the set of all possible sentences, R is the number of training files, j is a scalingfactor which aims to reduce the dynamic range between the language and acoustic probabilities, Pj

k ðsjorÞ represents thescaled posterior probability of the hypothesized sentence s, pk(orjs)j is the scaled acoustic score for s, p(s)j is the scaled lan-guage model probability of the sentence s, and A(s,sr) is the approximate raw phone transcription accuracy of the sentence sgiven the reference sentence sr, which equals the number of reference phones minus the number of errors. More details canbe found in [19,20].

It is obvious that maximizing the objective function is equivalent to minimizing the expected phone error. Povey dem-onstrated in [19,20] that the MPE criterion could significantly outperform the MMI criterion. The success of the MPE methodlies in its taking phone accuracy into consideration explicitly.

3.2. MTE objective function

Motivated by the success of the MPE training method in the speech recognition community, a new criterion called ‘‘min-imum tag error (MTE)”, which is a smoothed approximation to the sequence tagging accuracy, is proposed here for discrim-inative training of CRF. The objective function in MTE, which is to be maximized, is defined as

FMTEk ¼

XR

r¼1

XY2H

pkðYjXrÞAðY ;YrÞ; ð9Þ

where the parameter vector k represents the CRF parameters, H is the set of all possible tagging sequences given the obser-vation sequence Xr, R is the number of data sequences, and A(Y,Yr) is the raw sentence tagging accuracy of Y given the ref-erence Yr, which equals the number of correct labels in sentence Y. pk(YjXr), which was defined in Eq. (4), is the posteriorprobability of sequence Y given the rth observation sequence Xr. Then the MTE objective function can be expressed as

FMTEk ¼

XR

r¼1

PY2H expðk � FðY ;XrÞÞAðY; YrÞP

U2H expðk � FðU;XrÞÞ¼XR

r¼1

PY2H expðk � FðY;XrÞÞAðY ;YrÞ

ZkðXrÞ: ð10Þ


The MTE criterion is an average of the raw tag accuracy over all possible label sequences H (weighted by their likelihood) forthat training sequence Xr. A(Y,Yr) ideally equals the number of correct labels in sentence Y, which can be expressed as thesum of per-label accuracy Ai(y) over all positions in Y, where Ai(y) is defined as follows:

AiðyÞ ¼1 if yi ¼ yr

i

0 else

� �; ð11Þ

where i represents the ith position in the sentence.

3.3. MTE objective gradient

The MTE is trained by maximizing the objective function FMTEk defined in Eq. (10). To perform optimization, Eq. (10) can be

differentiated with respect to k by the rule of differentiation for fractions:

o

oxut¼ ou=ox

t� uot=ox

t2 : ð12Þ

The gradient of FMTEk can be obtained as

rFMTEk ¼

XR

r¼1

PY2H expðk � FðY;XrÞÞFðY ;XrÞAðY; YrÞ

ZkðXrÞ�P

Y2H expðk � FðY ;XrÞÞAðY;YrÞP

U2H expðk � FðU;XrÞÞFðU;XrÞZ2

k ðXrÞ

( )

¼XR

r¼1

XY2H

pkðY jXrÞFðY ;XrÞAðY ;YrÞ �XY2H

pkðY jXrÞAðY ;YrÞXU2H

pkðUjXrÞFðU;XrÞ( )

¼XR

r¼1

EpkðHjXrÞFðH;XrÞAðH;YrÞ � EpkðHjXrÞAðH;YrÞEpkðHjXrÞFðH;XrÞ� �

¼XR

r¼1

covpkðHjXrÞ½AðH;YrÞFðH;XrÞ�;

ð13Þ

where EpkðHjXrÞFðH;XrÞAðH;YrÞ is the expected value of the product of A(H,Yr) and F(H,Xr), EpkðHjXr ÞAðH;YrÞ denotes the expectedtagging accuracy which equals the value of the MTE objective function, EpkðHjXrÞFðH;XrÞ is the expectation of the feature vectorF(H,Xr), and covpkðHjXr Þ½AðH;YrÞFðH;XrÞ� is the conditional covariance matrix of A(H,Yr) and F(H,Xr).

An algorithm will now be presented for efficient calculation of the gradient defined in Eq. (13). For each position i in theobservation sequence X, define the jTj � jTj transitional matrix as

MiðXÞ ¼ ½Miðyi�1; yijXÞ�; ð14Þ

Miðyi�1; yijXÞ ¼ expX

k

kkfkðyi�1; yi;X; iÞ þX

k

lkgkðyi;X; iÞ !

¼ Siðyi�1; yijXÞ � OiðyijXÞ; ð15Þ

Siðyi�1; yijXÞ ¼ expX

k


;

OiðyijXÞ ¼ expX

k

lkgkðyi;X; iÞ !

; ð16Þ

where fk(yi�1,yi,X, i) represents a transitional feature function for the state pair (yi�1,yi), gk(yi,X, i) denotes a state function forthe state–observation pair (yi,X), and Si(yi�1,yijX) and Oi(yijX) can be interpreted as the transition cost and observation cost,respectively.

For each index i, the forward and backward state costs ai(y) and bi(y) can be defined as follows:

aiðyÞ ¼

Py02T

ai�1ðy0ÞMiðy0; yjXÞ; 0 < i 6 n;

1; i ¼ 0;

8<: ð17Þ

biðyÞ ¼

Py02T

Miþ1ðy; y0jXÞbiþ1ðy0Þ; 1 6 i < n;

1; i ¼ n:

8<: ð18Þ

For each index i, let a0iðyÞ and b0iðyÞ denote analogous quantities used to calculate accuracies. a0iðyÞ represents the averageaccuracy of partial label sequences leading up to the state at the ith position (including yi itself), and b0iðyÞ represents theaverage accuracy of the partial label sequences following yi, where


a0iðyÞ ¼

Py02T

a0i�1ðy0 Þai�1ðy0 ÞSiðy0 ;yjXÞP

y02Tai�1ðy0 ÞSiðy0 ;yjXÞ

þ AiðyÞ; 1 < i 6 n;

AiðyÞ; i ¼ 1;

8<: ð19Þ

b0iðyÞ ¼

Py02T

Miþ1ðy;y0 jXÞbiþ1ðy0 Þðb0iþ1ðy0ÞþAiþ1ðy0 ÞÞP

y02TMiþ1ðy;y0 jXÞbiþ1ðy0 Þ

; 1 6 i < n;

0; i ¼ n:

8<: ð20Þ

Therefore, Eq. (13) can be calculated by the forward and backward algorithm. If the feature fk is a transitional feature func-tion for the state pair (y0,y), then

EpkðHjXrÞAðH; YrÞf kðy0; y;XrÞ ¼XY2H

pkðYjXrÞAðY ;YrÞf kðy0; y;XrÞ

¼X

i

Xy0 ;y

ai�1ðy0ÞMiðy0; yjXrÞbiðyÞða0i�1ðy0Þ þ b0iðyÞ þ AiðyÞÞfkðy0; y;Xr ; iÞZkðXrÞ

; ð21Þ

EpkðHjXrÞf kðy0; y;XrÞ ¼XY2H

pkðYjXrÞf kðy0; y;XrÞ ¼X

i

Xy0 ;y

ai�1ðy0ÞMiðy0; yjXrÞbiðyÞfkðy0; y;Xr ; iÞZkðXrÞ

: ð22Þ

If the feature gk is a state function for the state–observation pair (y,X), then

EpkðHjXrÞAðH; YrÞgkðy;XrÞ ¼XY2H

pkðYjXrÞAðY; YrÞgkðy;XrÞ ¼X

i

Xy0 ;y

aiðyÞbiðyÞða0iðyÞ þ b0iðyÞÞgkðy;Xr ; iÞZkðXrÞ

; ð23Þ

EpkðHjXrÞgkðy;XrÞ ¼XY2H

pkðYjXrÞgkðy;XrÞ ¼X

i

Xy0 ;y

aiðyÞbiðyÞgkðy;Xr ; iÞZkðXrÞ

: ð24Þ

The calculation of the expected accuracy can be expressed as

EpkðHjXrÞAðH; YrÞ ¼XY2H

pkðY jXrÞAðY;YrÞ ¼P

y2Ta0nðyÞanðyÞZkðXrÞ

; ð25Þ

ZkðXrÞ ¼Xy2T

anðyÞ: ð26Þ

To reduce overfitting, the MTE objective function is penalized with a Gaussian weight prior as CRF calculations are per-formed. The objective function with the regularization term can then be rewritten in the following form:

F0MTEk ¼

XR

r¼1

XY2H

pkðYjXrÞAðY; YrÞ �X

k

kkk k2

2r2 ð27Þ

with gradient

rF0MTEk ¼

XR

r¼1

EpkðHjXr ÞAðH; YrÞFðH;XrÞ � EpkðHjXrÞAðH; YrÞEpkðHjXrÞFðH;XrÞ� �

�X

k

kk

r2 : ð28Þ

3.4. Optimization method

The L-BFGS algorithm is a well-known low-memory quasi-Newton optimization method which has been used success-fully to estimate the CRF parameters. In this paper, the L-BFGS method is also used in the parameter estimation process. Be-cause the MTE objective function in k is not convex, training methods will generally find a local optimum rather than theglobal optimum. Therefore, finding a good initialization for MTE learning is an important issue.

Because the MTE training criterion intends to make the more accurate hypothesized sentences more likely, using the MLor MAP parameters as the initial values for MTE training will help to find a better local maximum because both the ML andMAP criteria can maximize the posterior probability of the correct labeling sentence. However, the ML criterion is prone tooverfitting. Therefore, in the following experiments, the parameters will be initialized with the values obtained by MAPtrained models.

The MTE training method proposed in this paper was implemented here by modifying the CRF++ toolkit [12].

4. Experiments

4.1. Experimental data

Corpora from the second international Chinese word segmentation bakeoff were used to verify the effectiveness ofthe MTE training method. Performance values are reported in terms of three major metrics [5,16]: the F-score as given by


F-score = 2PR/(P + R) (where P is the word precision, and R the word recall), the recall on OOV words (Roov) and the recall onin-vocabulary words (Riv).

First, the PKU and MSR corpora were used to perform the segmentation evaluation on a closed track. Second, the OOVdetection ability was tested, because the CRF model has the advantage of being able to identify new words. From the statis-tics given in [5], the two corpora with the lowest and the highest OOV rates were selected: MSR and CityU. Finally, to studyfurther the robustness of the MTE training method, cross-testing was carried out, i.e., training on one dataset and testing onanother. Because the CityU corpora were written using the Big5 encoding, the CityU datasets were first converted to theCP936 encoding (Microsoft Codepage 936), i.e., the traditional Chinese characters were converted to simplified ones to makecross-testing possible. Note that this conversion could potentially influence the performance slightly due to a few conversionerrors. However, it would not influence the evaluation of the MTE method, because all training and test runs were based onthe same CityU corpora. A summary of the datasets is shown in Table 1.

Note that there are some inconsistencies in the PKU training and test corpora. In the training corpus, Arabic numbers andalphabetical letters are in SBC case form, i.e., each character is a full-width character occupying 2 bytes. However, in the testcorpus, these characters are expressed in DBC case form (with each character being a half-width character occupying 1 byte).Most researchers in the SIGHAN Bakeoff competition performed a conversion before segmentation. In this paper, two situ-ations, converted (cvt.) and unconverted (ucvt.), are considered.

Because Arabic numbers and alphabetical letters have not been transformed into SBC case form, the OOV rate for theunconverted PKU test corpus is higher than that for the converted corpus.

4.2. Feature templates and tag sets

The focus of this paper is whether or not the proposed MTE training method can achieve improved results compared withthe MAP training criterion on the Chinese word segmentation task. Therefore, commonly used unigram and bigram featuresare used as examples.

The feature templates used in the experiment can be expressed as follows:

� Cn (n = �1,0,1);� CnCn+1 (n = �1,0);� C�1C1

where Cn (n = �1,0,1) is the unigram feature representing the previous, current or next character. The other two fea-ture templates are bigram features. CnCn+1 (n = �1,0) represents the previous (next) character and the current charac-ter, and C�1C1 represents the previous character and the next character.

The usual 3-tag set (B, I, O) [32] and the newly developed 6-tag set (B, B2, B3, M, E, S) [33] were used to test performance.The detailed meanings of the 3-tag set and the 6-tag set are described in Table 2. For example, given a Chinese word of sevencharacters (People’s Republic of China)”, the character labeling sequence with the 3-tag set would be ‘‘B II I I I I”, while its corresponding labeling sequence with the 6-tag set would be ‘‘B B2 B3 M M M E”. Therefore, the 6-tag set hasan advantage in tagging words which contain more characters. In addition, words tagged with the 6-tag set can capture moreprecise contextual information, which always enables the CRFs’ learning of character tagging to achieve a better segmenta-tion performance than with the 3-tag set [33,34].

4.3. Experimental results

Tables 3 and 4 compare the performance of the MTE training criterion with that of the MAP criterion on the PKU and MSRdatasets, respectively.

From Table 3, it can be seen that use of MTE training improved segmentation performance with both the 3-tag and the 6-tag sets, whether or not the PKU test corpus was converted. Furthermore, the performance results in the case of the con-verted corpus were uniformly better than those with the unconverted corpus because the OOV rate in the converted testcorpus was lower than in the unconverted corpus. Moreover, the results with the 6-tag set were always better than those

Table 1Corpora statistics for Bakeoff 2005

Corpus Encoding # Words # Characters OOV rate

PKU_training CP936 1.11M 1.83M /PKU_test (cvt.) CP936 104K 0.17M 0.035PKU_test (ucvt.) CP936 104K 0.17M 0.058MSR_training CP936 2.37M 4.05M /MSR_test CP936 107K 0.18M 0.026CityU_training Big5 1.46M 2.40M /CityU_test Big5 41K 68K 0.074

Table 2Description of the 3-tag and 6-tag sets

Tag set Tag Meaning of the tag in a word

3-tag B Beginning of a wordI Character other than the beginning of a wordO Single-character word

6-tag B First character of a wordB2 Second character of a wordB3 Third character of a wordM Middle character of a wordE Last character of a wordS Single-character word

Table 3Performance of MAP and MTE on PKU using different tag sets on a closed track

Tag set Training method F-score Roov Riv

cvt. ucvt. cvt. ucvt. cvt. ucvt.

3-tag MAP 0.947 0.924 0.788 0.518 0.951 0.937MTE 0.948 0.928 0.791 0.573 0.952 0.941

6-tag MAP 0.952 0.932 0.777 0.568 0.957 0.947MTE 0.952 0.933 0.782 0.580 0.957 0.948

Table 4Performance of MAP and MTE with MSR using different tag sets on a closed track

Tag set Training method F-score Roov Riv

3-tag MAP 0.955 0.629 0.963MTE 0.957 0.640 0.964

6-tag MAP 0.972 0.722 0.977MTE 0.972 0.724 0.977


with the 3-tag set, expect for the Roov in the case of the converted corpus because of two factors: the OOV rate in the con-verted corpus is lower, and the average length of the words in the PKU corpora is less than three. Therefore, there may bemore room for improvement when using the 3-tag set than when using the 6-tag set. Fig. 1 illustrates that the improvementsobtained by the MTE training method were less when using the 6-tag set than when using the 3-tag set, except for the Roov inthe case of the converted corpus.

When the PKU test corpus was unconverted, the MTE training method achieved significant improvements, especially inthe case of the 3-tag set, where the F-score showed an increment of 0.4%. In addition, Roov and Riv were improved by 5.5% and0.4%, respectively. Even in the case of the 6-tag set, Roov was improved by 1.2%.

In the case of the converted corpus, when a 3-tag set was used, an F-score of up to 94.8% was achieved using MTE. Com-pared with MAP, the result was improved by 0.1%. At the same time, Roov and Riv were increased by 0.3% and 0.1%, respec-tively. When a 6-tag set was used, it can be seen from the last row of Table 3 that the MTE training method achieved the bestperformance. Moreover, Roov was improved by 0.5% compared with the MAP method.

From the MSR results (Table 4), the same conclusions can be drawn, namely that the MTE training method can improvethe segmentation results with either a 3-tag or 6-tag set. Because the average length of the words in the MSR corpora isgreater than three, the segmentation performance with a 6-tag set is consistently better than with a 3-tag set. Fig. 2 showsthat the improvement obtained using MTE was less with a 6-tag set than with a 3-tag set. The most important result is thatMTE training could remarkably enhance Roov. Compared with MAP, the Roov using MTE improved by 1.1% and 0.2%, respec-tively with 3-tag and 6-tag sets.

The results for OOV word detection are shown in Table 5. It can be seen that in OOV recognition capability, MTE outper-formed the conventional MAP training method. Furthermore, the higher the OOV rate, the greater the improvement obtainedby using MTE.

Finally, to test the robustness of these two training methods, cross-testing was performed, i.e., training on one dataset andtesting on other datasets. The results of the cross-testing are summarized in Tables 6–8, where the rows represent the train-ing corpora and the columns the testing corpora. The cross-testing performances for F-score, Roov, and Riv are presented in

Fig. 1. Improvement obtained by MTE using the PKU corpora.

Fig. 2. Improvement obtained using MTE on the MSR corpora.

Table 5Roov for the MSR/CityU corpora using MAP and MTE training methods and different tag sets

Corpus OOV rate Tag set Training method Roov Improvement (%)

MSR 0.026 3-tag MAP 0.629 /MTE 0.640 1.1

6-tag MAP 0.722 /MTE 0.724 0.2

CityU 0.074 3-tag MAP 0.727 /MTE 0.745 1.8

6-tag MAP 0.749 /MTE 0.756 0.7


Tables 6–8, respectively. The upper numbers and the lower numbers in each row in these three tables represent the 3-tag setand the 6-tag set, respectively. The separator ‘‘/” is used to separate the results obtained from MAP and from MTE.

Table 6Cross-testing of MAP and MTE training in terms of F-score

PKU MSR CityU

PKU (cvt.) 0.848/0.850 0.867/0.8670.855/0.855 0.870/0.870

PKU (ucvt.) 0.831/0.834 0.881/0.8820.840/0.841 0.886/0.886

MSR 0.856/0.857 0.831/0.8310.860/0.861 0.834/0.834

CityU 0.847/0.852 0.803/0.8060.853/0.855 0.812/0.813

Table 7Cross-testing of MAP and MTE training in terms of Roov

PKU MSR CityU

PKU (cvt.) 0.236/0.239 0.343/0.3480.218/0.221 0.359/0.361

PKU (ucvt.) 0.186/0.193 0.353/0.3540.161/0.164 0.364/0.365

MSR 0.181/0.184 0.254/0.2540.183/0.188 0.264/0.266

CityU 0.515/0.533 0.391/0.3970.523/0.528 0.389/0.395

Table 8Cross-testing of MAP and MTE training in terms of Riv

PKU MSR CityU

PKU (cvt.) 0.866/0.868 0.907/0.9090.875/0.875 0.912/0.912

PKU (ucvt.) 0.867/0.869 0.924/0.9260.876/0.876 0.930/0.930

MSR 0.931/0.932 0.917/0.9170.935/0.936 0.922/0.922

CityU 0.905/0.909 0.867/0.8680.918/0.919 0.876/0.876


It is not surprising that the results of cross-testing were worse than those of testing on the same source corpus due todifferences in segmentation rules. For example, in the PKU corpora, Chinese personal names are segmented as two words,the surname and the firstname, while in the MSR corpora they are regarded as a single word. Nevertheless, the performanceof MTE was superior to that of MAP. In particular, the OOV recall rates of MTE were always higher than those of MAP in allcross-testing experiments, which further demonstrated the advantage of the OOV detection ability of the MTE trainingmethod.

5. Discussions

The main contribution of this paper is to propose a new criterion which is integrated with sentence tagging accuracy inthe process of training. The new criterion is inspired by the MPE criterion which has been successfully applied in the field ofspeech recognition. To verify the effectiveness of the proposed MTE criterion, experiments were conducted on Chinese wordsegmentation tasks. Because many researchers have used the CRF model for Chinese word segmentation using Bakeoff 2005datasets in recent years, it is easy to perform comparisons with results obtained using these advanced systems. Table 9 liststhe experimental results obtained here together with the best ones from previous research in terms of F-score and Roov. Theseparator ‘‘/” is used to separate the values of F-score and Roov. The last three columns in Table 9 contain the results pre-sented in [32,35,6], respectively.

Table 9Comparison of our results with the best ones in terms of F-score/Roov

Corpus MTE CRF (MAP) Bakeoff 2005-best [32] [35] [6]

PKU 0.952/0.782 0.952/0.777 0.950/0.813 0.951/0.748 0.952/0.672 0.946/0.848MSR 0.972/0.724 0.972/0.722 0.964/0.718 0.971/0.712 0.974/0.750 0.965/0.612CityU 0.956/0.756 0.955/0.749 0.943/0.736 0.951/0.741 0.948/0.692 0.937/0.561


Table 9 illustrates that the new training criterion presented here achieves state-of-the-art performance except for the re-sults reported by Zhao and Kit [35] for the MSR corpus and the Roov values reported by Bakeoff 2005-best and [6] for the PKUcorpus. The highest Roov value in Bakeoff 2005 is achieved by Zhou et al. [36]. Because the experiments were performed on aclosed track, some information such as Arabic and Chinese numbers, alphabetical letters, and punctuations, was not used inthis paper. Furthermore, because the purpose of this work is to demonstrate the effectiveness of the new criterion ratherthan purely to obtain good results for Chinese word segmentation, only commonly used unigram and bigram features areused to train the model. However, more features and more information were used in [6,36]. As a result, their OOV recall ratesare higher than those presented here for the PKU corpus. The results using CRF trained by the classical MAP method areshown in the third column and can be considered directly comparable with those obtained in [35], because both systemsused the same feature templates and tag set (6-tag set). It is not immediately clear why the results are different. It is possiblethat slight differences could be caused by use of different parameters in the Gaussian prior distribution or by use of a dif-ferent version of the CRF toolkit.

The advantage of the new criterion lies in its direct use of tag error minimization rather than the MAP criterion whichmaximizes the posterior probability of the correct sentence. Therefore, the MTE criterion is more suitable for the goalof obtaining a sequence with high tagging accuracy when labeling a new sentence. For example, the Chinese word phrasein PKU test corpus, (Beijing The China Millennium Monument)” is wrongly segmented as

(Beijing/China/century/altar)” by the MAP trained model, while it is correctly segmented as(Beijing/China/Millennium Monument)” by the MTE trained model developed in this work, because

the word (the Millennium Monument)” is an OOV word, but the two words (century)” and (altar)”are known words for the PKU training corpus. From this example, it is evident that the new criterion developed in this re-search has a clear advantage in recognizing OOV words.

6. Conclusions

A new criterion, minimum tag error, is proposed in this paper for discriminative training of CRF. The new criterion is asmoothed approximation to the weighted average accuracy of all possible labeling sequences in the lattice, which can beoptimized directly by gradient-based methods. Rather than the MAP training method, which maximizes the posterior prob-ability of the correct sequence on the training dataset, the new criterion tries to make more accurate transcriptions morelikely. Corpora from the SIGHAN Bakeoff 2005 are used to test whether or not the MTE criterion outperforms the classicalMAP training method.

The experimental work consisted of three parts. In the first part, the effectiveness of the proposed training criterion wasevaluated with two tag sets (3-tag and 6-tag sets) using the PKU and MSR corpora on a closed track. The experimental resultsshowed that the MTE training method exhibited improved segmentation performance compared with the MAP criterion,using either 3-tag or 6-tag sets. The results were especially encouraging for the recall rate of OOV words. In the second part,the OOV identification ability of MTE and MAP was compared. The two corpora, MSR and CityU, which had the lowest andhighest OOV rates were used to perform the comparison. It was observed that the MTE criterion had a clear advantage inrecognizing OOV words. Furthermore, the higher the OOV rate, the more improvement the MTE achieved. In the final part,the robustness of the new training method was investigated through cross-testing. The results revealed that the performanceof the proposed MTE method was superior to that of MAP.

On the basis of the observed effect of the proposed MTE training method on Chinese word segmentation tasks, additionalexperiments are planned to test whether or not the MTE training method is suitable for other natural language processing(NLP) tasks such as NER or POS tagging. In addition, because the MTE objective function is not convex, other optimizationmethods such as stochastic gradient descent [9,28] and QuickProp [22] should also be tried to optimize the trainingprocedure.

Acknowledgements

Thanks to Taku for providing the CRF toolkit package, SIGHAN Bakeoff 2005 for providing the data and reviewers for use-ful suggestions.

References

[1] Y. Altun, T. Hofmann, Large margin methods for label sequence learning, in: Proceeding of the Eighth European Conference on Speech Communicationand Technology (EuroSpeech 2003), Geneva, Switzerland, 2003, pp. 993–996.


[2] B. Chen, Y.T. Chen, Extractive spoken document summarization for information retrieval, Pattern Recognition Letters 29 (4) (2007) 217–238.[3] W. Chen, Y. Zhang, H. Isahara, Chinese named entity recognition with conditional random fields, in: Proceedings of the Fifth SIGHAN Workshop on

Chinese Language Processing, Sydney, Australia, 2006, pp. 118–121.[4] Y. Chen, K.-P. Chan, Using data mining techniques and rough set theory for language modeling, ACM Transactions on Asian Language Information

Processing 6 (1) (2007). Article 2.[5] T. Emerson, The second international Chinese word segmentation bakeoff, in: Proceedings of the Fourth SIGHAN Workshop on Chinese Language

Processing, Jeju Island, Korea, 2005, pp. 123–133.[6] G. Fu, C. Kit, J.J. Webster, Chinese word segmentation as morpheme-based lexical chunking, Information Sciences 178 (9) (2008) 2282–2296.[7] C.-L. Goh, M. Asahara, Y. Matsumoto, Chinese word segmentation by classification of characters, Computational Linguistics and Chinese Language

Processing 10 (3) (2005) 381–396.[8] S.S. Gross, O. Russakovsky, C.B. Do, S. Batzoglou, Training conditional random fields for maximum labelwise accuracy, in: Proceedings of the 20th

Annual Conference on Neural Information Processing Systems (NIPS 2006), Vancouver, Canada, 2006.[9] A. Gunawardana, M. Mahajan, A. Acero, J.C. Platt, Hidden conditional random fields for phone classification, in: Proceedings of the Ninth European

Conference on Speech Communication and Technology (InterSpeech 2005), Lisbon, Portugal, 2005, pp. 1117–1120.[10] M. Jeong, G.G. Lee, Practical use of non-local features for statistical spoken language understanding, Computer Speech and Language 22 (2008) 148–

170.[11] R. Klinger, L.I. Furlong, C.M. Friendrich, H.T. Mevissen, J. Fluck, S. Juliane, F. Sanz, M. Hofmann-Apitius, Identifying gene specific variations in biomedical

text, Journal of Bioinformatics and Computational Biology 5 (6) (2007) 1277–1296.[12] T. Kudo, CRF++: yet another CRF toolkit, 2007. <http://crfpp.sourceforge.net/>.[13] J.W. Kuo, S.H. Liu, H.M. Wang, B. Chen, An empirical study of word error minimization approaches for mandarin large vocabulary speech recognition,

International Journal of Computational Linguistics and Chinese Language Processing 11 (3) (2006) 201–222.[14] J. Lafferty, A. McCallum, F. Pereira, Conditional random fields: probabilistic models for segmenting and labeling sequence data, in: Proceedings of the

18th International Conference on Machine Learning (ICML 2001), Williamstown, MA, USA, 2001, pp. 282–289.[15] C.-J. Lee, J.S. Chang, J.-S.R. Jang, Extraction of transliteration pairs from parallel corpora using a statistical transliteration model, Information Sciences

176 (2006) 67–90.[16] G.-A. Levow, The third international Chinese language processing bakeoff: word segmentation and named entity recognition, in: Proceeding of the

Fifth SIGHAN Workshop on Chinese Language Processing, Sydney, Australia, 2006, pp. 108–117.[17] A. McCallum, W. Li, Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons, in:

Proceedings of the Seventh Conference on Nature Language Learning, Edmonton, Canada, 2003, pp. 188–191.[18] H.-J. Oh, S.H. Myaeng, M.-G. Jang, Semantic passage segmentation based on sentence topics for question answering, Information Sciences 177 (18)

(2007) 3696–3717.[19] D. Povey, P.C. Woodland, Minimum phone error and I-smoothing for improved discriminative training, in: Proceedings of IEEE International

Conference on Acoustics, Speech, and Signal Processing (ICASSP 2002), Orlando, Florida, USA, 2002, pp. 105–108.[20] D. Povey, Discriminative training for large vocabulary speech recognition, Ph.D. Thesis, Cambridge University, Cambridge, 2004.[21] F. Peng, F. Feng, A. McCallum, Chinese segmentation and new word detection using conditional random fields, in: Proceedings of the 20th International

Conference on Computational Linguistics, Geneva, Switzerland, 2004, pp. 562–568.[22] J.L. Roux, E. McDermott, Optimization methods for discriminative training, in: Proceedings of the Ninth European Conference on Speech

Communication and Technology (InterSpeech 2005), Lisbon, Portugal, 2005, pp. 3341–3344.[23] F. Sha, F. Pereira, Shallow parsing with conditional random fields, in: Proceedings of Human Language Technology Conference and North American

Chapter of the Association for Computational Linguistics, Edmonton, Canada, 2003, pp. 213–220.[24] J. Suzuki, E. McDermott, H. Isozaki, Training conditional random fields with multivariate evaluation measures, in: Proceedings of the 21st International

Conference on Computational Linguistics and 44th Annual Meeting of the ACL, Sydney, Australia, 2006, pp. 217–224.[25] B. Taskar, C. Guestrin, D. Koller, Max-margin Markov networks, in: Proceedings of the 17th Annual Conference on Neural Information Processing

Systems (NIPS 2003), Vancouver, Canada, 2003.[26] J.T. Tou, An intelligent full-text Chinese–English translation system, Information Sciences 125 (1–4) (2000) 1–18.[27] I. Tsochantaridis, T. Joachims, T. Hofmann, Y. Altun, Large margin methods for structured and interdependent output variables, Journal of Machine

Learning Research 6 (2005) 1453–1484.[28] S. Vishwanathan, N. Schraudolph, M. Schmidt, K. Murphy, Accelerated training of conditional random fields with stochastic meta-descent, in:

Proceeding of the 23rd International Conference on Machine Learning, Pittsburgh (ICML 2006), USA, 2006, pp. 969–976.[29] A. Wu, Customizable segmentation of morphologically derived words in Chinese, Computational Linguistics and Chinese Language Processing 8 (1)

(2003) 1–28.[30] N. Xue, Chinese word segmentation as character tagging, Computational Linguistics and Chinese Language Processing 8 (1) (2003) 29–48.[31] M.Y. Zhang, Z.-D. Lu, C.-Y. Zou, A Chinese word segmentation based on language situation in processing ambiguous words, Information Sciences 162

(3–4) (2004) 275–285.[32] R. Zhang, G. Kikui, E. Sumita, Subword-based tagging by conditional random fields for Chinese word segmentation, in: Proceeding of the Human

Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, New York, USA, 2006, pp. 193–196.[33] H. Zhao, C.N. Huang, M. Li, An improved Chinese word segmentation system with conditional random fields, in: Proceedings of the Fifth SIGHAN

Workshop on Chinese Language Processing, Sydney, Australia, 2006, pp. 162–165.[34] H. Zhao, C.N. Huang, M. Li, B.L. Lu, Effective tag set selection in Chinese word segmentation via conditional random field modeling, in: Proceedings of

PACLIC-20, Wuhan China, 2006, pp. 87–94.[35] H. Zhao, C. Kit, Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition, in:

Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing, Hyderabad, India, 2008, pp. 106–111.[36] J. Zhou, X. Dai, R. Ni, J. Chen, A hybrid approach to Chinese word segmentation around CRFs, in: Proceedings of the Fourth SIGHAN Workshop on

Chinese Language Processing, Jeju Island, Korea, 2005, pp. 196–199.

http://crfpp.sourceforge.net/

Documents

Minimum tag error for discriminative training of conditional random fields