13
A top-down information theoretic word clustering algorithm for phrase recognition Yu-Chieh Wu Department of Communication and Management, Ming-Chuan University, 250 Zhong Shan N. Rd., Sec. 5, Taipei 111, Taiwan article info Article history: Received 14 March 2012 Received in revised form 23 January 2014 Accepted 6 February 2014 Available online 27 February 2014 Keywords: Large-scale word clustering Phrase chunking Support vector machine abstract Semi-supervised machine learning methods have the features of both, integrating labeled and unlabeled training data. In most structural problems, such as natural language process- ing and image processing, developing labeled data for a specific domain requires consider- able amount of human resources. In this paper, we present a cluster-based method to fuse labeled training and unlabeled raw data. We design a top-down divisive clustering algo- rithm that ensures maximal information gain in the use of unlabeled data via clustering similar words. To implement this idea, we design a top-down iterative K-means clustering algorithm to merge word clusters. Differently, the derived term groups are then encoded as new features for the supervised learners in order to improve the coverage of lexical infor- mation. Without additional training data or external materials, this approach yields state- of-the-art performance on the shallow parsing and base-chunking benchmark datasets (94.50 and 93.12 in F (b) rates). Ó 2014 Elsevier Inc. All rights reserved. 1. Introduction The task of arbitrary phrase chunking has recently received increasing attention in many natural language processing (NLP) research issues [16,33]. The aim of text chunking is to identify the predefined grammatical phrase structures in text, such as noun phrase (NP) and verb phrase (VP). These phrases are non-nested and non-recursive, i.e., they cannot be included with each other [1]. To automatically extract phrase chunks, many supervised learning methods are required to receive suf- ficient labeled data to achieve state-of-the-art performance [46,49]. In a supervised learning framework, a well-annotated corpus is created, and it usually requires a considerable amount of time cost for domain experts. Then, a trained chunking system (chunker) can recognize novel text online, through encoding and representing context features. Features with (non)fixed size of local context words are chiefly derived from the given annotated corpus, and are immediately represented as training examples. Such features may include surface words, part- of-speech (POS) tags, and pre-defined orthography types. Although supervised learning methods rapidly develop a robust chunking system, the requirement of substantial amounts of training data is still an impediment to the quick deployment of phrase chunking in new languages or domains. However, it is often the case that developing sizable training data it considerably time-consuming whereas the amount of raw data is rapidly increasing. To further exploit the effects of unlabeled data, several studies investigated on how to incorporate raw data [17]. A few examples include co-training [5,32], semi-supervised structural learning [2,41,45], and http://dx.doi.org/10.1016/j.ins.2014.02.033 0020-0255/Ó 2014 Elsevier Inc. All rights reserved. Tel.: +886 2 2882 4564x2100; fax: +886 2 2881 8675. E-mail address: [email protected] Information Sciences 275 (2014) 213–225 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins

A top-down information theoretic word clustering algorithm for phrase recognition

Embed Size (px)

Citation preview

Page 1: A top-down information theoretic word clustering algorithm for phrase recognition

Information Sciences 275 (2014) 213–225

Contents lists available at ScienceDirect

Information Sciences

journal homepage: www.elsevier .com/locate / ins

A top-down information theoretic word clustering algorithmfor phrase recognition

http://dx.doi.org/10.1016/j.ins.2014.02.0330020-0255/� 2014 Elsevier Inc. All rights reserved.

⇑ Tel.: +886 2 2882 4564x2100; fax: +886 2 2881 8675.E-mail address: [email protected]

Yu-Chieh Wu ⇑Department of Communication and Management, Ming-Chuan University, 250 Zhong Shan N. Rd., Sec. 5, Taipei 111, Taiwan

a r t i c l e i n f o a b s t r a c t

Article history:Received 14 March 2012Received in revised form 23 January 2014Accepted 6 February 2014Available online 27 February 2014

Keywords:Large-scale word clusteringPhrase chunkingSupport vector machine

Semi-supervised machine learning methods have the features of both, integrating labeledand unlabeled training data. In most structural problems, such as natural language process-ing and image processing, developing labeled data for a specific domain requires consider-able amount of human resources. In this paper, we present a cluster-based method to fuselabeled training and unlabeled raw data. We design a top-down divisive clustering algo-rithm that ensures maximal information gain in the use of unlabeled data via clusteringsimilar words. To implement this idea, we design a top-down iterative K-means clusteringalgorithm to merge word clusters. Differently, the derived term groups are then encoded asnew features for the supervised learners in order to improve the coverage of lexical infor-mation. Without additional training data or external materials, this approach yields state-of-the-art performance on the shallow parsing and base-chunking benchmark datasets(94.50 and 93.12 in F(b) rates).

� 2014 Elsevier Inc. All rights reserved.

1. Introduction

The task of arbitrary phrase chunking has recently received increasing attention in many natural language processing(NLP) research issues [16,33]. The aim of text chunking is to identify the predefined grammatical phrase structures in text,such as noun phrase (NP) and verb phrase (VP). These phrases are non-nested and non-recursive, i.e., they cannot be includedwith each other [1]. To automatically extract phrase chunks, many supervised learning methods are required to receive suf-ficient labeled data to achieve state-of-the-art performance [46,49].

In a supervised learning framework, a well-annotated corpus is created, and it usually requires a considerable amount oftime cost for domain experts. Then, a trained chunking system (chunker) can recognize novel text online, through encodingand representing context features. Features with (non)fixed size of local context words are chiefly derived from the givenannotated corpus, and are immediately represented as training examples. Such features may include surface words, part-of-speech (POS) tags, and pre-defined orthography types.

Although supervised learning methods rapidly develop a robust chunking system, the requirement of substantialamounts of training data is still an impediment to the quick deployment of phrase chunking in new languages or domains.However, it is often the case that developing sizable training data it considerably time-consuming whereas the amount ofraw data is rapidly increasing. To further exploit the effects of unlabeled data, several studies investigated on how toincorporate raw data [17]. A few examples include co-training [5,32], semi-supervised structural learning [2,41,45], and

Page 2: A top-down information theoretic word clustering algorithm for phrase recognition

214 Y.-C. Wu / Information Sciences 275 (2014) 213–225

unsupervised methods [12,53]. Koo et al. [26] demonstrated excellent performance on dependency parsing in the use ofword clusters [7]. In text categorization, word clustering [34] exhibits more powerful results than traditional word featureselection criteria, such as mutual information and chi-square statistics [4,39]. Moreover, later literature [37] significantly im-proves the clustering of words in the task of named-entity recognition, with the use of large-scale unlabeled data.

Unlike previous studies, this paper presents a new word-clustering algorithm based on the information gain theory. Itbegins by deriving the objective function that explicitly minimizes/maximizes the probability distributions within/betweenclusters. The derived objective function is used to evaluate the clustering information gain. Later, we design a top-down divi-sive clustering algorithm to align the objective function based on the measurement of the probability distributions betweenthe clusters or words. Following this, we analyze the computational time complexity of our word-clustering algorithm andcompare it with other top-down, bottom-up, and divisive methods. The clustered term groups are then encoded as a part ofthe feature set in our chunking system. In order to capture different levels of word clusters, we aggregate the clustering re-sults by applying word clustering with different cluster numbers. Our work is considerably different from the so-called semi-supervised SVM [21,38], in which the semi-supervised SVM directly performs the training and testing on unlabeled data; ourmethod, however, expands the features by deriving cluster features from the unlabeled data. It ensures that the traditionalsemi-supervised SVM adjusts the parameters in use, of both, the labeled and unlabeled data. We investigated the systemperformance of the attached term clusters for three chunking tasks, including the CoNLL-2000 shallow parsing, large-scaleshallow parsing, and base-chunking. The comparisons are shared in the following part.

The rest of this paper is organized as follows. Section 2 presents the problem settings and related works. We explain ournews content recognition algorithm in Section 3. In Section 4, an evaluation and discussion of our system is given. Finally, inSection 5, we conclude our method and draw a remark on future prospects.

2. Prior arts in chunking

Arbitrary phrase chunking is a classic task in natural language processing. It can be viewed as a sequence of labeling tasks[35]. This method is employed to solve other NLP tasks as well, such as named entity recognition [2], clause identification,parsing [36,44], and Chinese word segmentation [48,53]. With the rapid development of machine learning algorithms, theuse of supervised learning approaches have become mainstream in this field. For example, literatures [25,28,46] showedexcellent performance in chunking by means of the SVM classifiers. Zhang et al. [51] combined the parser outputs as featuresfor chunking improvement. Multiple classifier-based approaches, such as voted-perceptrons [8] and memory-based learning[43] were also proposed to enhance the single classifier by combining multiple machine learning algorithms.

The above approaches did not take the entire state sequence into training and testing. In testing, they predict the label byconsidering the local contextual information. This easily leads to the so-called label-bias problem [27], especially when thecontext feature is limited. To address this, structural learning algorithms, such as the conditional random fields (CRFs) [27]and SVM–HMM [22], were designed to optimize structural learning rather than local context-based learning. Further, struc-tural learners have the advantage of considering the entire structure, instead of a limited history. The structural learningalgorithms are successfully applied to many tasks. Examples include syntactic phrase chunking [41,42] and sequence reran-king [11].

Most recent semi-supervised learning methods that integrate both the supervised learning algorithms and sizable unla-beled data showed improved results over traditional machine learning-based algorithms. In 2005, Ando and Zhang proposedan earlier way for creating multi-classifiers with unlabeled data. The predicted result is mainly determined by combiningmore than hundreds of single machine learning classifiers. Suzuki and Isozaki [41] and Turian et al. [45] presented semi-supervised structural learning algorithms by including the loss function in unlabeled data. Using the word information fromthe unlabeled data provides an alternative method in semi-supervised learning. The main advantage is that it can be encodedas a feature to arbitrary machine learning classifiers. Koo et al. [26] exhibited excellent performance on dependency parsingin the use of word clusters [7]. In text categorization, word clustering [34] presents powerful results over the traditionalword feature selection criteria, such as mutual information and chi-square statistics [4,39]. Ratinov and Roth [37] reportedthat the combination of supervised learners and word clusters significantly improves the performance of named entityrecognition.

3. Supervised phrase chunking

3.1. Phrase chunking

Ramshaw and Marcus [35] first proposed an inside/outside label style to represent the noun phrase chunks. This methodinvolves three main tags: B, I, and O. The I tag indicates the current token that is inside a chunk, the B tag represents thebeginning of a chunk that immediately follows another chunk, and the O tag is the current token that does not belong toa part of any chunk. This method is also called the IOB1. Tjong Kim Sang [43] derived three of its alternative versions,IOB2, IOE1, and IOE2.

Page 3: A top-down information theoretic word clustering algorithm for phrase recognition

Y.-C. Wu / Information Sciences 275 (2014) 213–225 215

IOB2: It is different from IOB1, which uses the B tag to mark each of beginning token of a chunk and labels the other insidetokens as I tag.IOE1: In this, an E tag is denoted as the ending token of a chunk that is immediately before a chunk.IOE2: The E tag is given for every token that is at the end of a chunk.

To illustrate the four representation styles with an example, consider an incomplete sentence ‘‘In early trading in busyHong Kong Monday.’’ The four representation styles of the sentence are listed as follows.

An example for IOB1/2 and IOE1/2 chunk representation styles

IOB1

IOB2 IOE1 IOE2

In

O O O O early I B I I trading I I I E In O O O O busy I B I I Hong I I I I Kong I I E E Monday B B I E

This example only encodes the noun phrase chunk type. It can be extended to mark the other phrase types by labeling thespecific type behinds the I/B/E tags. For example, the B-VP is used to represent the beginning of a verb phrase (VP) in theIOB2 style.

3.2. General chunking models

In general, contextual information is often used as the basic feature type. Further, other features can then be derivedbased on the surrounding words, such as words and their POS tags. The chunk tag of the current token is mainly determinedby the contextual information. Similar to previous researches [18,25], we employ a classification algorithm (i.e. SVM) learn-ing to classify the chunk class for each token via encoding the context features. It enumerates all the features as vector spacemodels and uses the SVM to label the vectors. Differing from the traditional text categorization [50], the chunking tasksequentially performs ‘‘classification’’ on each word incrementally. In accordance with the previous studies [25,46,47], thefollowing features are compiled to form the training and testing vectors.

� Lexical information (Unigram/Bigram).� POS tag information (UnPOS/BiPOS/TriPOS).� Affix (2–4 suffix and prefix letters).� Previous chunk information (UniChunk/BiChunk).� Word feature type [46].� Possible chunk classes [46].� Word + POS Bigram (current token + next token’s POS tag).

In addition, the chunking directions can be reversed from left-to-right to right-to-left. The original left-to-right chunkingprocess classifies tokens with the original directions, i.e., the class of the current token is determined after chunking all thepreceding tokens of the current word. In the reverse process, the chunking process begins with the last token of the sentenceand ends with the first token in the reverse order. We name the original chunking process as forward chunking, whereas thereverse process is termed backward chunking.

In this paper, we employ linear kernel SVMs with L2-norm dual coordinate descent optimization [19] as the classificationalgorithm. SVMs have shown considerable success in classification problems [18,25]. Because the SVM algorithm is a binaryclassifier, we have to convert the traditional multiclass classification problems into multiple binary problems. It is to use theOne-versus-All (OVA) type to solve the problem. As discussed in the paper [18,46,47], work in the linear kernel is far moreefficient than in polynomial kernels. We chose the linear kernel type, considering the efficiency of time.

4. Top-down information theoretic word clustering

In this section, first, we briefly introduce the principal of information theoretic clustering. Then, we derive the globalobjective functions to obtain optimal information gain. Based on the objective functions, a top-down divisive clustering algo-rithm is presented.

Page 4: A top-down information theoretic word clustering algorithm for phrase recognition

216 Y.-C. Wu / Information Sciences 275 (2014) 213–225

4.1. Information theoretic clustering

Word clustering defines the measure of similarity between words and clusters (which contain several words). Informa-tion theoretic clustering focuses on evaluating the probability distribution of individual clusters over the whole set of events.Let C = {c1,c2, . . . ,cM} be the chunk class set (IOB tags). P(C|wt) is the probability distribution of word wt over the wholecategory set. When a cluster CLj contains two words, wt and ws, the probability distribution of P(C|CLj) is equivalent to theweighted averages of P(C|wt) and P(C|ws) individually. Eq. (1) defines the probability distribution of cluster CLj.

PðCjCLjÞ ¼ PðCjwt _wsÞ

¼ PðwtÞPðCLjÞ

PðCjwtÞ þPðwsÞPðCLjÞ

PðCjwsÞð1Þ

where P(CLj) = P(wt) + P(ws). KL-divergence (see Eq. (2)) can be used to estimate the difference between the two probabilitydistributions, i.e., the KL-distance. Therefore, the probability distributional distance between the word wt and CLj can becomputed using KL-divergence.

KLðPðCjwtÞjjPðCjCLjÞÞ ¼ PðCjwtÞ logPðCjwtÞPðCjCLjÞ

� �

¼XM

i¼1

PðcijwtÞ logPðcijwtÞPðcijCLjÞ

� � ð2Þ

However, KL-divergence is not symmetric and disobeys the triangle inequality ([13], p. 18). On the contrary, the Jensen-Shannon (JS) divergence is a symmetric measurement and is bounded [30]. It utilizes both ratios of the KL-divergence.JS-divergence is defined by Eq. (3).

JSðPðCjwtÞjjPðCjCLjÞÞ ¼PðwtÞ

PðwtÞ þ PðCLjÞKLðPðCjwtÞjjPðCjCLjÞÞ þ

PðCLjÞPðwtÞ þ PðCLjÞ

KLðPðCjCLjÞjjPðCjwtÞÞ ð3Þ

Information theoretic clustering in the formation of clusters can minimize the global information loss. This idea can bequantified by computing the mutual information (MI), which is a measure of the amount of information that one randomvariable contains of another. In other words, the MI measurements can be used to find the information loss of a clusteringresult. An information loss function [15] of a clustering result is defined by Eq. (4).

Information Loss ¼ MIðC; WÞ �MIðC; CLKÞ ð4Þ

The left item measures the mutual information between the whole word set (W) and the event set, whereas the right item isthe mutual information of a clustering result CLk (K is the number of clusters) over the event set.

In 2003, (Dhillon [15], Theorem 1 and Lemma 2) derived the measurement of the information loss via JS-divergence andKL-divergence.

MIðC; WÞ �MIðC; CLKÞ ¼XK

j¼1

Xwt2CLj

PðwtÞKLðPðCjwtÞjjPðCjCLjÞÞ ð5Þ

Instead of a symmetric probability distribution in JS-divergence, Eq. (5) simplifies the measurements using the sum of KL-divergence ‘‘within’’ the cluster. Ideally, good clustering should result from the minimization of information loss (minimizeEq. (5)).

Different from the minimizing objects, we derives objective functions that optimize the information ‘‘gain.’’ Alternatively,minimization of the information loss is equivalent to the maximization of the information gain (see Eq. (6)). That is,

Information Gain ¼ MIðC; CLKÞ �MIðC; WÞ ð6Þ

Here, we use the conventional definition of mutual information and expand (6), (7).

MIðC; WÞ ¼XM

i¼1

Xwt2W

Pðci;wtÞ logPðci;wtÞ

PðciÞPðwtÞ

� �

MIðC; CLKÞ ¼XM

i¼1

XK

j¼1

Pðci;CLjÞ logPðci;CLjÞ

PðciÞPðCLjÞ

� �

¼XM

i¼1

XK

j¼1

PðCLjÞPðcijCLjÞ logPðci; CLjÞ

PðciÞ

� �ð7Þ

As shown in Eq. (7), intuitively, MI(C;W) is a fixed constant that can be replaced by a stable value. In contrast, term MI(C;CLK)measures the amount of information in the current clustering result. Using KL-divergence (Eq. (2)), we can simplify the def-inition of MI(C;CLK) as follows.

Page 5: A top-down information theoretic word clustering algorithm for phrase recognition

Y.-C. Wu / Information Sciences 275 (2014) 213–225 217

MIðC; CLKÞ ¼XK

j¼1

PðCLjÞ � KLðPðCjCLjÞjjPðCÞÞ

¼XK

j¼1

PðCLjÞ � KLðPðCjCLjÞjjPðCjWÞÞð8Þ

P(C) is the probability distribution of the whole word set, which is equal to P(C|W). Eq. (8) signifies that the mutual infor-mation of the clustering is equivalent to the measure of the probability distribution distance ‘‘between’’ the clusters andthe entire empirical distribution, i.e., the KL-divergence ‘‘between’’ clusters. To recall the information gain (Eq. (6)) of a wordclustering, maximizing Eq. (8) is equal to the optimization of information gain. The longer the distance between clusters, thelarger the information gain. Moreover, this result coincides with the objective function of the other clustering techniques,such as the bisecting K-means [52], which aims to find the clustering through minimizing the cosine similarity measurementbetween the clusters.

Consequently, by combining the minimizing information loss and maximizing information gain, a hybrid objective func-tion is produced. Thus, both criteria can be considered (see Eq. (9)) to measure the amount of information gain and loss.

XK

j¼1

PðCLjÞKLðPðCjCLjÞjjPðCjWÞÞPwt2CLj

PðwtÞKLðPðCjwtÞjjPðCjCLjÞÞð9Þ

In Eq. (9), the denominator is the information loss [15], whereas the numerator is derived from information gain, obtainedfrom Eq. (8). It simultaneously measures the probability distribution ‘‘between/within’’ clusters. Our clustering objectivegoal is to maximize the cluster results using (9). It can be viewed as the criteria of measuring the quality of the current clus-ters. Clearly, Eq. (9) considers more information from the clusters than the original information loss function (Eqs. (5) and(6)). The original information loss accumulates individual distances between the word and its corresponding clusters,whereas information gain focuses on the distance between the word clusters. We combine both these terms to present abetter quality of clustering. As reported by [52], a hybrid criteria often performs well in document clustering tasks.

Our clustering algorithm iteratively finds optimal clusters until the value of (9) does not change. In next section, wediscuss how the clustering algorithm and the connection to the derived objects.

4.2. Clustering algorithm

Our top-down information theoretic clustering is mainly based on a divisive method of creating a new cluster and incre-mentally applying the K-means algorithm. We use a heuristic to initialize the initial clusters. Then, sequentially, the algo-rithm finds the clustering result that achieves the goal of the objective functions. It is to be noted that our methodcarefully creates a new cluster that contributes to optimal information gain, hence differing from [15], which initializesall the clusters and directly applies K-means at once. In contrast, our method elaborately selects a word as the initial centroidof the new cluster by maximizing the information gain until the target cluster number is reached. The entire clustering pro-cess can be viewed as the top-down clustering style. Using top-down clustering, the high risk involved in randomly andsimultaneously selecting terms of all the clusters can be avoided. The proposed top-down clustering algorithm is presentedin Fig. 1.

In the initialization step, we assign each word to its initial cluster by likelihood estimation. Second, we check and removethe unambiguous words to reduce the word size. Because these terms only occur in a specific category, we can assume thatthe initialization step produces K0 clusters where K0 < K. It is obvious that a small number of clusters usually leads to consid-erable noise and does much harm to the downstream applications. Reasonably, we skip the word clustering problem whenthe target number K is smaller than the number of categories. Instead, we merely employ a large amount of word groups tothe clustering.

On the one hand, our top-down clustering algorithm selects a cluster that contributes to the largest information loss (step5), and forms a new cluster by maximizing the information gain of the original cluster on the other (step 6). The main roles ofstep (4)–(6) are used to solve the objective functions. In this paper, we present three different objective functions to achieveminimal information loss or maximal information gain (Eqs. (5), (8), and (9)). Although there are various combinations, wesimply use Eq. (5) to select a cluster, and create a new cluster by maximizing the information gain (Eq. (8)). However, it mustpay to employ an exhaustive way to form a new cluster via scanning the whole word set, unless a word must be decidedfrom the selected clusters at first.

A classic K-means algorithm iteratively performs clustering that optimizes the objective function. By repeating step 2 tostep 6, the clustering algorithm progresses until the desired number of clusters, K, is obtained.

The well-known bisecting K-means algorithm [52] is a top-down clustering framework as well. The main concept ofbisecting K-means relies on splitting the original cluster into two groups that optimizes the coherent similarity betweenor within the clusters. On the contrary, without regarding the differences between the objective functions, our method con-siders the overall clusters instead of merely splitting the cluster into two. In other words, we apply the K0-means at each stepwhereas the bisecting K-means runs exactly 2-means. Furthermore, our top-down information theoretic clustering

Page 6: A top-down information theoretic word clustering algorithm for phrase recognition

Fig. 1. Top-down information theoretic word clustering algorithm.

218 Y.-C. Wu / Information Sciences 275 (2014) 213–225

algorithm intends to form clustering that maximizes information gain, whereas the bisecting K-means addresses a cosine-based coherent similarity measurement among clusters.

Furthermore, our method is considerably different from the distributional clustering techniques that were widely appliedon the text categorization tasks, including the divisive information theoretic clustering [15], agglomerative distributionalclustering [3], and agglomerative information bottleneck [39,40]. We use a similar initialization method as that of the liter-ature [15]. Previously, two agglomerative clustering methods merged two closest clusters step-by-step and then added aword. The divisive information theoretic clustering, as discussed above, runs the K-means at once and uses only Eq. (5) asthe objective function. In contrast, our method incrementally presents divisive clustering and produces a new cluster thatmaximizes information gain.

4.3. Time complexity analysis

Now, we analyze the computational time complexity of our clustering algorithm. For simplicity, we only use the casewhere all the words are ambiguous. As shown in Fig. 1, step 2–4 monotonously runs the K0-means algorithm, which costsO(K0RN). R is the number of times the stop criteria reaches and N is the size of the whole word set. Moreover, the time costof creating a new cluster (steps 5 and 6) is N + N = 2N (scanning the whole set twice) at the most. The overall computationaltime complexity is as follows.

ðNRþ 2NÞ þ ð2NRþ 2NÞ þ . . .þ ðKNRþ 2NÞ ¼ 12

KðK þ 1ÞR0N ¼ OðK2RNÞ ð10Þ

where R0 = R + 2.As shown above, our method over-scales up when the number of the target cluster, K, is set too high. In general, K is

significantly smaller than the entire word set N. It is known that the conventional K-means algorithm costs O(KNR). However,K-means is quite sensitive to the initial centroid selection. One can employ the initialization method [15]. However, in ourcase, K is relatively larger than the number of categories. Therefore, many clusters are still initialized randomly.

Page 7: A top-down information theoretic word clustering algorithm for phrase recognition

Table 1Data statistics of the three tasks, shallow parsing, base chunking, and large-scale shallow parsing.

Data statistics # of examples # of sentences # of category

Shallow parsing (CoNLL-2000)Training 220,663 8935 11 � 2 + 1 = 23Testing 49,389 2011

Base chunkingTraining 950,028 39,832 20 � 2 + 1 = 41Testing 56,684 2416

Large-scale shallow parsingTraining 956,696 40,063 11 � 2 + 1 = 23Testing 217,070 9145

Y.-C. Wu / Information Sciences 275 (2014) 213–225 219

Nevertheless, the time complexity of the agglomerative distributional clustering [3], which is similar to ours, costsO(K2N). However, the time cost of the agglomerative information bottleneck [40] is rather inefficient and costs O(L3), whereL is the top L selected words (usually L > K). Compared to the agglomerative information bottleneck clustering, our method iscomputationally superior.

4.4. Comparisons

The most similar work is presented by Ratinov and Roth [37] that show positive improvement on the named entity rec-ognition tasks by deriving word clusters from a large amount of unlabeled data. Our method differs in the following mannerfrom [37]: (a) the clustering algorithm – their clustering algorithm has a cubic complexity; (b) we combine three cluster lev-els of size; (c) tasks – we focus on phrase chunking while their focus was on named entity recognition. Ando and Zhang [2]designed a very large and complicated work on semi-supervised phrase chunking. They trained 500 single classifiers withdifferent features and created the training data from large amounts of unlabeled data. Therefore, it needs to perform trainingand testing for the 500 classifiers. On the contrary, our method only requires clustering the words once, and forming theadditional features for one classifier. Clearly, the focus of their study is to adjust the parameters of the learner, rather thanmining for new features from the data.

An amount of previous literatures investigated the supervised SVM-based phrase chunking over the past decade. Kudoand Matsumoto [25] proposed an earlier work on phrase chunking using multiple polynomial kernel SVMs. They combinedeight classifiers and achieved an improvement of 0.05% only. More recently, Wu et al. [46] presented a mask method thatadjusted the SVM by deriving more training examples with incomplete lexical information. One advantage of this methodis that it is independent of external resources and achieves optimal accuracy using a single SVM as a classifier. Our methoddiffers considerably from Wu’s work because their method does not create any additional features from the training as wellas the unlabeled data. Our method uses word clusters to increase the features that enhance the classifier. Essentially, it canbe further combined with the mask method.

5. Experimental results

In this section, two issues are investigated. It begins with reporting the chunking results of three tasks including shallowparsing, base-chunking and large-scale shallow parsing. Second, we compare the chunking performance of our method withother chunking systems using the same benchmarking corpus. The benchmark of corpus of base-chunking is the Wall StreetJournal (WSJ) of the English Treebank, from which, sections 02–21 are used for training and section 23 is used for testing. Forthe shallow parsing task, sections 15–18 of the WSJ are used to train and section 20 is used to test. Sections 0–19 are used asthe training parts for large-scale chunking and sections 20–24 are used for evaluation. The POS-tag information of the threetasks is generated from Brill-tagger [6]. Table 1 lists all the statistics of the benchmark corpus that are used.

The second column in Table 1 shows the number of training examples for each dataset. Column 3 lists the number of sen-tences in the data. The last column in Table 1 indicates the number of actual categories in the task. To represent a phrasestructure, two special lead tags are often used, the Begin (B) and Interior (I) tags, while tag ‘‘O’’ is used to express thenon-phrase words [43]. Hence, if there are 11 phrase types, it will result in 11 (phrase types) � 2 (lead tags) + 1 (Otag) = 23 categories.

In 2008, Wu et al. conducted experiments about four different representation styles of lead tags and the left-to-right andright-to-left classification directions using the same benchmark corpus. As reported by [47], the IOE2 with backward chunk-ing shows the best accuracy. We follow the settings suggested in [47].

The performance of the chunking task is measured by three rates, i.e. recall, precision, and F(b). CoNLL released aperl-script evaluator1 that enabled us to estimate the three rates automatically.

1 http://www.cnts.ua.ac.be/conll2000/chunking/conlleval.txt.

Page 8: A top-down information theoretic word clustering algorithm for phrase recognition

220 Y.-C. Wu / Information Sciences 275 (2014) 213–225

5.1. Settings

Our top-down word clustering was performed on enormous amounts of training and testing data2 that was unlabeled. Weutilize the TREC-CD1 and CD2 WSJ 1988–1992 section as unlabeled data. A sentence that contains less than 90% lower case let-ters is ignored. After cleaning the data, it resulted in 237,892 distinct words and 157,346 sentences. All the words were toke-nized and POS tagged by the Brill-tagger. It is worth noting that the category list of word clustering was not annotated yet.Therefore, we directly trained our chunking system to label the chunk class of each token. We then used the predicted chunkclass to initialize the clusters and extract the context (+/� one word size) POS and chunk tags to form the category list C forlatter clustering. Nevertheless, for the three chunking tasks, we performed using three different chunkers to obtain their cor-responding word clusters.

It is not a good idea to cluster punctuations and stop-words since they are meaningfulless and frequently appear in text.Before performing word clustering, all stopwords and punctuations must be eliminated. Words are also required to be con-verted to lowercase.

Similar to Brown clustering [26], our clustering algorithm performs on various numbers of clusters. The use of coarse-to-fine grained size of clusters allows us to capture different level of the word clusters. We set cluster size as: 25, 50, and 100.

Detail technical settings of our learner is described below. First, the word shape feature were included as features (seeTable 1 in [46]). The hyperparameter of SVM is set as 0.1 for all experiments by following previous researches [46]. For wordsthat occur less than twice in the training data is ignored. To search for the optimal prediction sequence, we adopted theViterbi search as in [48].

5.2. Cluster feature encoding

The use of word cluster for feature encoding is rather simple. Only the cluster number is used. We treat the cluster num-ber as cluster label to form a feature. For each of word, we look up the corresponding cluster number as level 1, level 2, andlevel 3. The below shows the basic idea of cluster feature encoding. Our method additionally derived three features (level 1–3cluster numbers) for each of word. In other words, the word cluster label can be viewed as the second POS tag information.Level 1 (cluster size = 25), 2 (cluster size = 50), and 3 (cluster size = 100) corresponds to the three grained size of clusters. Forexample, the word ‘‘stations’’ belongs to cluster 15 in the level 1 and cluster 39 in level 2.

Level 1–3 features give important clue to the classifier. It provides not only the semantic or syntactic related word groupsbut also increase the coverage of unknown words. In our setting, we directly adopt the three levels as new features of theoriginal supervised method. Similar to conventional lexical features, the bigram of level 1–3 labels, unigram of level 1–3 la-bels were encoded in the supervised chunking system. For example, to predict the chunk tag of ‘‘Angles’’, the bigram of level1–3 labels are (C17, C10), (C10, C81).

2

Word

We did not take the actual chun

POS

k tag of the testing dat

Level 1

a into account; instead, we

Level 2

only focus on word cluster

Level 3

ing.

Chunk tag

Two

CD O O O I-NP Los NP C9 C41 C48 I-NP Angeles NP C17 C10 C81 I-NP radio NN C1 C13 C67 I-NP stations NNS C15 C39 C59 E-NP . . .. . .. . .. . .. . ..

An example of encoding cluster features.

5.3. Validation results

First, we conducted the experiment on the validation data. The validation data is mainly derived from the training set thatis independent of the testing data. Then, the n-fold cross-validation is performed over the validation data. We divided thetraining set into four equal parts. Each time, four parts are selected to form the training data, while the remaining parts wereused for validation. The training to testing data ratio is 3:1, a ratio that is widely used in previous literatures [2,45].

Table 2 lists the empirical result of the three tasks for the CoNLL-2000, base-chunking and large-scale chunking tasks. Todemonstrate the effect of our method, we compare the accuracies of the two models, (A) and (B). Model (A) is trained with-out using the clustered word groups, whereas model (B) encodes the word clusters as a feature set of the chunking system. Inthis experiment, accuracy is evaluated by the F-measure rate. Each raw data indicates the F-measure performance of adataset. The final raw is the average F-measure rate over the four cross-validation result.

As shown in Table 2, word clustering yields better accuracy than the original approach. For each part, our method con-sistently improves the F-measure rate over model A. Among the three tasks, our method has the greatest improvement on

Page 9: A top-down information theoretic word clustering algorithm for phrase recognition

Table 2Cross-validation results of the three tasks .

CoNLL-2000 Base chunking Large-scale chunking

Model A Model B Model A Model B Model A Model B

Part 1 93.91 93.97 92.73 92.75 95.58 95.63Part 2 93.79 94.00 92.61 92.60 95.35 95.41Part 3 94.39 94.55 92.53 92.53 95.29 95.39Part 4 94.31 94.45 92.03 92.23 95.19 95.22AVG 94.10 94.24 92.47 92.53 95.35 95.41

Y.-C. Wu / Information Sciences 275 (2014) 213–225 221

the CoNLL-2000 task. The proposed word cluster feature is mainly designed to cover more lexical information and the rela-tions between unknown words and the words in the labeled training data. In comparison to the other two tasks, the trainingdata size of the CoNLL-2000 is smaller by a quarter. Obviously, the number of unknown/unseen words in CoNLL-2000 is sig-nificantly greater than in the other two tasks. Therefore, the chunking model that adopts the word cluster features has moreunseen words to be improved.

To see the impact for different SVM optimization methods, we compared with five different SVM implementation meth-ods, namely, L1Dual solver [19], L2Dual solver [19], L2-MFN [23], SVM-perf [20], multiclass [24], and SVM–HMM [22]. L1(-norm) presents the L1 loss function in SVM while the L2 (-norm) presents the square loss in the SVM. The SVM-perf isessentially an L1-norm. Except for the SVM–HMM, the other five SVM implementations are non-structured methods. More-over, the multiclass SVM optimizes the loss using the multiclass loss function. The above-mentioned methods can directlysolve the multiclass problems without any alterations. However, the other four methods need to port the original SVM tomulticlass. As reported by [47,48], we adopted the so-called one-versus-all method for all the experiments.

Table 3 lists the detailed results of the six SVM implementations for the three tasks. In terms of testing accuracy, theL2Dual optimizer obtains the best F1 score. The L2MFN demonstrates a relatively close performance as compared to thatof the L2Dual. However, the training time cost is 2–3 times larger than that of L2Dual. The SVM-multiclass exhibited theworst F-measure rate among all the approaches, even though it had the least training and testing time cost. Obviously,the L2Dual optimizer should be chosen because it is not only accurate but also efficient.

5.4. Testing results

The overall testing result of the three tasks is summarized in Table 4. Both, model A and model B, were applied to thetesting data. As shown in Table 4, the use of word clusters significantly improves the original chunking model in the CoN-LL-2000 and base-chunking tasks. In particular, in base-chunking, the F-measure rate of model B shows 5.20% error reductionrate (92.61 vs. 93.09). The main reason behind this is that the testing data of the base-chunking task contains more unseenwords that cannot be directly obtained from the original training data. On the contrary, the use of word cluster features thatderived from the large amounts of unlabeled data provides enhanced lexical information, by showing better accuracy in pre-dicting unknown words. When the testing data contains a rich set of unknown words, the cluster feature supports to providepositive information to learners for prediction.

Table 5 lists the detailed results of the six SVM implementations for the three tasks using the testing data. Similar to thevalidation data results, the L2Dual and L2MFN achieves the best accuracy among all the SVM implementations. L2Dual ismuch quicker than L2MFN. Further, the best time performance is obtained by the SVM-multiclass. However, the SVM-mul-ticlass is not as accurate as the other five approaches. In this experiment, the SVM–HMM displayed a significant generaliza-tion power. SVM–HMM outperformed L1Dual in the base-chunking task.

Table 3Comparison of validation performance for the three tasks using different SVM implementation.

Validation Data SVM-L1Dual SVM-L2Dual SVM-L2MFN SVM-perf SVM-MultiClass SVM–HMM

CoNLL-2000F(b) 94.10 94.24 94.24 94.01 93.58 93.78Tr.Time 26.69 28.14 77.78 148.59 18.61 113.56Te.Time 3.63 3.66 3.84 4.27 3.72 4.77

Base-chunkF(b) 92.33 92.52 92.53 92.25 90.02 92.50Tr.Time 170.82 176.86 1078.12 1079.61 91.12 1019.83Te.Time 16.74 16.89 17.30 19.94 16.35 18.66

Large-scale chunkingF(b) 95.30 95.41 95.40 95.22 93.87 95.06Tr.Time 134.24 131.65 646.11 660.45 77.50 534.74Te.Time 11.54 11.72 11.93 13.18 11.40 16.02

Page 10: A top-down information theoretic word clustering algorithm for phrase recognition

Table 4Overall phrase chunking accuracy result.

Testing data Model (A) Model (B)

CoNLL-2000 94.35 94.50Base-chunking 92.61 93.09Large-scale shallow parsing 95.14 95.24

Table 5Comparison of testing performance for the three tasks using different SVM implementation (The bolder text is the best result than the other SVM classifiers).

Testing data SVM-L1Dual SVM-L2Dual SVM-L2MFN SVM-perf SVM-MultiClass SVM–HMM

CoNLL-2000F(b) 94.24 94.50 94.48 94.28 93.77 94.16Tr.Time 37.27 37.95 121.34 202.06 24.55 175.05Te.Time 3.36 3.66 3.70 3.98 3.31 4.78

Base-chunkF(b) 92.76 93.13 93.12 92.49 89.74 92.95Tr.Time 245.02 247.65 1652.50 1509.03 126.72 1580.77Te.Time 6.22 6.49 6.77 9.03 6.16 7.91

Large-scale chunkingF(b) 95.06 95.24 95.24 94.96 93.39 94.94Tr.Time 180.34 188.65 979.77 914.90 107.20 814.42Te.Time 11.48 11.73 11.92 13.31 11.36 16.86

222 Y.-C. Wu / Information Sciences 275 (2014) 213–225

Next, we report the detailed results of the three tasks and compare our method with the existing state-of-the-artapproaches. Table 6 lists the comparison results of the related approaches and Table 7 summarizes the detailed chunkingperformance of our method.

As shown in Table 6, our method achieves the second best system performance. In shallow parsing, fewer papers addressthe issue of clustering words from labeled data. Among all the chunking systems, the best system accuracy [41] was achievedby training eight conditional random fields (CRFs) and hidden Markov models (HMMs) using more than 1G unlabeled dataset. The earlier work discussed by [2] also explored the effects of training multiple classifiers with unlabeled data. In contrast,we present a different way, by clustering related words from the amount of unlabeled data. There are no conflicts to employ-ing a similar approach to improve the performance. Wu et al. [46] presented a similar idea to refine the hyper-parameters ofSVM by representing more training examples via incomplete lexical information (from 93.67 to 94.12). Essentially, it is afully supervised method using only the labeled training data.

Daume and Maru [14] designed the 2-pass sequential labeling methods by integrating the averaged perceptron and atrained inference algorithm. The inference algorithm can be viewed as an enhanced component over the initial taggerand the method is compatible with arbitrary machine learning methods such as the SVM, and CRF. Suzuki et al. [42] is anearlier version of the hybrid HMM and CRF learners. It still requires multiple chunking algorithms to be trained. Clearly, bothtraining and testing time cost is several times larger than that of ours. The performance of this approach is still worse thanthat of our method. In addition, Turian et al. [45] proposed a cluster-based semi-supervised algorithm for text chunking, andnamed entity recognition. They reported that by improving the word representation (neural-based word embedding), theoverall prediction accuracy outperformed the traditional supervised methods. As shown in Table 6, besides offering betteraccuracy, our method offers a cheaper and a more efficient approach. The other existing studies [8,25] directly constructa multiple classifier approach. The only difference between the two methods is the classification algorithms. Multiple

Table 6Comparison of chunking performance for shallow parsing (CoNLL-2000) task.

Approach All NP Description

Suzuki and Isozaki [41] 95.15 – Semi-supervised structural learningThis paper 94.50 94.95 Non-structural learning (SVM) + hybrid inference algorithmDaume and Maru [14] 94.40 94.47 Structural learning + searchingAndo and Zhang [2] 94.39 94.70 (semi-supervised) non-structural learningSuzuki et al. [42] 94.36 – 17 M words unlabeled dataTurian et al. [45] 94.35 – Word clusters + Neural-based Word EmbeddingWu et al. [47] 94.25 94.68 Non-structural learning (SVM) + deterministic inferenceZhang et al. [51] 94.17 94.38 With full parser outputKudo and Matsumoto [25] 93.91 94.39 Combining 8 single SVMsCarreras et al. [8] 93.79 – Voted-perceptrons

Page 11: A top-down information theoretic word clustering algorithm for phrase recognition

Table 7Shallow parsing (CoNLL-2000) performance of our model in different phrase type.

F(b=1) F(b)

ADJP 78.19 NP 94.90ADVP 82.74 PP 97.78CONJP 60.00 PRT 79.63INTJP 66.67 SBAR 88.72LST 0.00 VP 94.73

All 94.50

Table 8Comparisons of chunking performance for base-chunking and large-scale shallow parsing tasks.

Recall Precision F(b)

Base-chunkingOur method 92.86 93.38 93.12SVM + Mask method [46] 92.93 92.98 92.95�Maximum Entropy Parser [9] 91.81 92.73 92.27�Head word driven [10] 89.68 89.70 89.69

Large-scale shallow parsingOur method + word clustering 95.04 95.25 95.12without word clustering 95.10 94.88 94.99

Table 9Statistical significance test between using cluster feature and not using cluster feature.

Task s-Test McNemar-test

Shallow parsing (CoNLL-2000) � �Base-chunking � �Large-scale shallow parsing � �

� Means the p-value < 0.01; > means the p-value < 0.05; � means the p-value = 0.05.

Y.-C. Wu / Information Sciences 275 (2014) 213–225 223

classification systems are far more complex than an individual classifier. As shown above, our method (single classifier)yields a significantly better performance. We think that our method can be further improved by combining more thanone classifier. This has been left as future work.

As shown in Table 7, three phrase types (CONJP, INTJP, and LST) had significantly worse accuracy than the others. Suchcases occurred in the previous literatures as well [2,25,41,46,48]. The main reason responsible for this is the imbalance indata. The number of training examples of these three phrase types is less than 0.5%. In contrast, the NP had more than50%. In the complete Treebank data, the ratio (0.64%) is still below 1%. Solving the imbalance problem might be useful. How-ever, the benefit is limited owing to the small part of these three phrase types.

In the second comparison, we report our method on the base-chunking task [29,46]. Base-chunking can be viewed as theinitial step toward bottom-up phrase-based parsing [43], which aims to find the first level phrase structure of the parsingtree. The phrase grain size of the base-chunking varies distinctly from shallow parsing as well. They share the same targetthat identifies the phrase chunks in text. Here, we compare our method with the previous literature reported by [46]. Table 8lists the comparison results of the previous methods.

It is worth noting that parser comparison lacks sense, because the goal of parsing is to find the word attachment struc-tures, instead of phrases. However, parsers can generate the phrase structure by means of transforming the relation of head-modifier too; thus, we optionally list the automated-parsed base-chunking results of the two state-of-the-art parsers.

In large-scale testing, we simply extend the original training and testing corpus by four times to evaluate the system sca-lability. Because of the limitation of space, we do not list the result of this task in detail. The chunking performance of thelarge-scale task is listed at the base of Table 8.

In 2002, Molina and Pla [31] explored the larger training size. They used the specialized hidden Markov model (HMM) aslearners and achieved 93.25 in F(b) rate. Obviously, our method outperforms the specialized HMM even though it was trainedon WSJ sections 0–19. To recall, our shallow parsing performance is better than that of the specialized HMM, which onlymade use of a quarter of the training size (94.44 in F(b)).

5.5. Significant test

We proceed with the statistical difference test among the previous methods. There are two test methods, s-test [50] andMcNemar-test. They are usually employed to evaluate the system outputs on the standard benchmark corpora. This

Page 12: A top-down information theoretic word clustering algorithm for phrase recognition

224 Y.-C. Wu / Information Sciences 275 (2014) 213–225

evaluation is based on the binary decisions on all the word/chunk pairs. We collect the system outputs and perform statis-tical tests. The alternative hypothesis (H1) is that system A occurs to win system B with more than one half of probability,and the null hypothesis (H0) is that both the systems occur to win or lose with one half of probability. To reject H0 means thatthe system A must outperform system B in terms of statistical significance. Table 9 lists the statistical significance test result.

Applying the micro-sign test (s-test) and McNemar-test, our method employing the cluster features have a statistical sig-nificant difference in the three tasks. Further, the evaluation shows that the p-value is below 0.01, i.e. our method differsfrom the classic supervised method under the 99.9% confidence level.

6. Conclusion

In this paper, we present a theoretic clustering algorithm of top-down information to improve phrase chunking. Alongwith the derivation of the objective function, which explicitly maximizes/minimizes the probability distribution between/within clusters, the theoretic word clustering of top-down information incrementally optimizes the information gain. It ben-efits from the clustering words made from unlabeled texts, taken as features to inform the classifier. The lexical informationcan be enhanced with the introduction of word clusters. The experimental results show that these clustered word groups canenhance and achieve a state-of-the-art performance on the three benchmark tasks: shallow parsing, large-scale shallowparsing, and base-chunking. Moreover, in the significance test, it was proved significantly different from the original super-vised learning. These outcomes encourage us to apply this method on other tasks, such as the named entity recognition, andphrase-based full parsing.

References

[1] S. Abney, Parsing by chunks, in: Principle-Based Parsing: Computation and Psycholinguistics, 1991, pp. 257–278.[2] R.K. Ando, T. Zhang, A high-performance semi-supervised learning method for text chunking, in: Proceedings of 43rd Annual Meetings of the

Association for Computational Linguistics (ACL), 2005, pp. 1–9.[3] L.D. Baker, A.K. McCallum, Distributional clustering of words for text classification, in: Proceedings of 21st Annual International ACM SIGIR, 1998, pp.

96–103.[4] R. Bekkerman, R. Ei-Yaniv, N. Tishby, Y. Winter, On the feature distributional clustering for text categorization, in: Proceedings of 24th ACM SIGIR

Conference on Research & Development on Information Retrieval, Location, 2001, pp. 146–153.[5] A. Blum, T. Mitchell, Combining labeled and unlabeled data with co-training, in: Proceedings of Conference on Learning Theory, 1998, pp. 92–100.[6] E. Brill, Transformation-based error-driven learning and natural language processing: a case study in part of speech tagging, Comput. Linguist. 21 (4)

(1995) 543–565.[7] P.F. Brown, P.V. deSouza, R.L. Mercer, V.J. Della Pietra, J.C. Lai, Class-based n-gram models of natural language, Computat. Linguist. 18 (1992) 467–479.[8] X. Carreras, L. Marquez, J. Castro, Filtering-ranking perceptron learning for partial parsing, Mach. Learn. J. 59 (2005) 1–31.[9] E. Charniak, A maximum-entropy-inspired parser, in: Proceedings of the First Conference on North American chapter of the Association for

Computational Linguistics (NAACL), 2000, pp. 132–139.[10] M. Collins, Head-driven Statistical Models for Natural Language Processing, PhD Thesis, University of Pennsylvania, 1999.[11] M. Collins, Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms, in: Proceedings of the

Conference on Empirical methods in natural language processing, vol. 10, 2002, pp. 1–8.[12] M. Collins, Y. Singer, Unsupervised models for named entity classification, in: Proceedings of Empirical Methods in Natural Language Processing

(EMNLP), 1999, pp. 100–110.[13] T.M. Cover, J.A. Thomas, Elements of Information Theory, John Wiley & Sons, New York, 1991.[14] Hal Daumé III, Daniel Marcu, Learning as search optimization: approximate large margin methods for structured prediction, in: Proceedings of the

International Conference on Machine Learning, 2005, pp. 169–176.[15] I.S. Dhillon, S. Mallela, R. Kumar, A divisive information-theoretic feature clustering algorithm for text classification, J. Mach. Learn. Res. 3 (2003) 1265–

1287.[16] G. Fu, C. Kit, J.J. Webster, Chinese word segmentation as morpheme-based lexical chunking, Inf. Sci. 178 (9) (2008) 2282–2296.[17] B. Gils, E. Proper, P. Bommel, P. Weide, On the quality of resources on the Web: an information retrieval perspective, Inf. Sci. 177 (20) (2007) 4566–

4597.[18] J. Giménez, L. Márquez, Fast and accurate part-of-speech tagging: the SVM approach revisited, in: Proceedings of the International Conference on

Recent Advances in Natural Language Processing (RANLP), 2003, pp. 158–165.[19] Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S. Sathiya Keerthi, S. Sundararajan, A dual coordinate descent method for large-scale linear SVM, in:

Proceedings of the International Conference on Machine Learning (ICML), 2008, pp. 408–415.[20] Thorsten Joachims, Training linear SVMs in linear time, in: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery

and Data Mining, 2006, pp. 217–226.[21] T. Joachims, Transductive inference for text classification using support vector machines, in: Proceedings of the International Conference on Machine

Learning (ICML), 1999, pp. 200–209.[22] Thorsten Joachims, Thomas Finley, Chun-Nam John Yu, Cutting-plane training of structural SVMs, Mach. Learn. 77 (1) (2009) 27–59.[23] S. Sathiya Keerthi, Dennis DeCoste, A modified finite Newton method for fast solution of large scale linear SVMs, J. Mach. Learn. Res. 6 (2005) 341–361.[24] S. Sathiya Keerthi, S. Sundararajan, Kai-Wei Chang, Cho-Jui Hsieh, Chih-Jen Lin, A sequential dual method for large scale multi-class linear svms, in:

Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008, pp. 408–416.[25] T. Kudo, Y. Matsumoto, Chunking with support vector machines, in: Proceedings of the 2nd Meetings of the North American Chapter and the

Association for the Computational Linguistics (NAACL), 2001, pp. 192–199.[26] T. Koo, X. Carreras, M. Collins, Simple semi-supervised dependency parsing, in: Proceedings of 46th Annual Meetings of the Association for

Computational Linguistics (ACL), 2008, pp. 595–603.[27] John Lafferty, Andrew McCallum, Fernando Pereira, Conditional random fields: probabilistic models for segmenting and labeling sequence data, in:

Proceedings of the International Conference on Machine Learning, 2001, pp. 282–289.[28] Y.S. Lee, Y.C. Wu, A robust multilingual portable phrase chunking system, Expert. Syst. Appl. 33 (3) (2007) 1–26.[29] X. Li, D. Roth, Exploring evidence for shallow parsing, in: Proceedings of Conference on Natural Language Learning (CoNLL), 2001, pp. 127–132.[30] J. Lin, Divergence measures based on the Shannon entropy, IEEE Trans. Inf. Theory 3 (1) (1991) 145–151.[31] A. Molina, F. Pla, Shallow parsing using specialized HMMs, J. Mach. Learn. Res. 2 (2002) 595–613.[32] V. Ng, C. Cardie, Weakly supervised natural language learning without redundant views, in: Proceedings of the Meeting of North American Chapter of

Association for Computational Linguistics (NAACL), 2003, pp. 173–180.

Page 13: A top-down information theoretic word clustering algorithm for phrase recognition

Y.-C. Wu / Information Sciences 275 (2014) 213–225 225

[33] H.J. Oh, S.H. Myaeng, M.G. Jang, Semantic passage segmentation based on sentence topics for question answering, Inf. Sci. 177 (18) (2007) 3696–3717.[34] F. Pereira, N. Tishby, L. Lee, Distributional clustering of English words, in: Proceedings of 31st Annual Meetings of the Association for Computational

Linguistics (ACL), 1993, pp. 183–190.[35] L.A. Ramshaw, M.P. Marcus, Text chunking using transformation-based learning, in: Proceedings of the 3rd Workshop on Very Large Corpora, 1995, pp.

183–190.[36] Adwait Ratnaparkhi, Learning to parse natural language with maximum entropy models, Mach. Learn. 34 (1999) 151–175.[37] L. Ratinov, D. Roth, Design challenges and misconceptions in named entity recognition, in: Proceedings of the 13th Conference on Computational

Natural Language Learning (CoNLL), 2009, pp. 147–155.[38] I.S. Reddy, S. Shevade, M.N. Murty, A fast Quasi-Newton method for semi-supervised SVM, Pattern Recogn. (2011) 2305–2313.[39] N. Slonim, N. Tishby, Document clustering using words clusters via the information bottleneck method, in: Proceedings of 21st Annual International

ACM SIGIR, 2000, pp. 208–215.[40] N. Slonim, N. Tishby, The power of word clusters for text classification, in: Proceedings of 23rd Colloquium on Information Retrieval Research (ECIR),

2001, pp. 191–200.[41] J. Suzuki, H. Isozaki, Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data, in: Proceedings of 46th Annual

Meetings of the Association for Computational Linguistics (ACL), 2008, pp. 665–673.[42] Jun Suzuki, Akinori Fujino, Hideki Isozaki, Semi-supervised structural output learning based on a hybrid generative and discriminative approach, in:

Proceedings of the Empirical Methods in Natural Language Processing, 2007, pp. 791–800.[43] E.F. Tjong Kim Sang, Transforming a chunker to a parser, in: Computational Linguistics in the Netherlands, 2000, pp. 177–188.[44] Y. Tsuruoka, J. Tsujii, Chunk parsing revisited, in: Proceedings of the Ninth International Workshop on Parsing Technology, 2005, pp. 133–140.[45] J. Turian, L. Ratinov, Y. Bengio, Word representations: a simple and general method for semi-supervised learning, in: Proceedings of 48th Annual

Meetings of the Association for Computational Linguistics (ACL), 2010, pp. 384–394.[46] Y.C. Wu, C.H. Chang, Y.S. Lee, A general and multi-lingual phrase chunking model based on masking method, in: Proceedings of 7th International

Conference on Intelligent Text Processing and Computational Linguistics, 2006, pp. 144–155.[47] Y.C. Wu, Y.S. Lee, J.C. Yang, Robust and efficient multiclass SVM models for phrase pattern recognition, Pattern Recogn. 41 (9) (2008) 2874–2889.[48] Y.C. Wu, J.C. Yang, Y.S. Lee, S.J. Yen, An integrated deterministic and nondeterministic inference algorithm for sequential labeling, in: Proceedings of 6th

Asia Information Retrieval Symposium (AIRS), 2010, pp. 221–230.[49] Y. Xiong, J. Zhu, H. Huang, H. Xu, Minimum tag error for discriminative training of conditional random fields, Inf. Sci. 179 (1) (2009) 169–179.[50] Y. Yang, X. Liu, A re-examination of text categorization methods, in: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research

and Development in Information Retrieval, 1999, pp. 42–49.[51] T. Zhang, F. Damerau, D. Johnson, Text chunking based on a generalization Winnow, J. Mach. Learn. Res. 2 (2002) 615–637.[52] Y. Zhao, G. Karypis, Evaluation of hierarchical clustering algorithms for document datasets, in: Proceedings of the eleventh international conference on

Information and knowledge management, 2002, pp. 515–524.[53] Hai Zhao, Chunyu Kit, Integrating unsupervised and supervised word segmentation: the role of goodness measures, Inf. Sci. 181 (1) (2011) 163–183.