Author's personal copy - nlp.postech.ac.krnlp.postech.ac.kr/~ywyoon/papers/ipm979.pdf · Author's personal copy When performing a text classi cation task in a real world situation,

This article was originally published in a journal published byElsevier, and the attached copy is provided by Elsevier for the

author’s benefit and for the benefit of the author’s institution, fornon-commercial research and educational use including without

limitation use in instruction at your institution, sending it to specificcolleagues that you know, and providing a copy to your institution’s

administrator.

All other uses, reproduction and distribution, including withoutlimitation commercial reprints, selling or licensing copies or access,

or posting on open internet sites, your personal or institution’swebsite or repository, are prohibited. For exceptions, permission

may be sought for such use through Elsevier’s permissions site at:

http://www.elsevier.com/locate/permissionusematerial

http://www.elsevier.com/locate/permissionusematerial

Autho

r's

pers

onal

co

py

Efficient implementation of associative classifiersfor document classification

Yongwook Yoon *, Gary Geunbae Lee

Department of Computer Science and Engineering, Pohang University of Science and Technology, San 31,

Hyoja-Dong, Pohang 790-784, Republic of Korea

Received 25 May 2006; accepted 25 July 2006Available online 12 October 2006

Abstract

In practical text classification tasks, the ability to interpret the classification result is as important as the ability toclassify exactly. Associative classifiers have many favorable characteristics such as rapid training, good classificationaccuracy, and excellent interpretation. However, associative classifiers also have some obstacles to overcome when theyare applied in the area of text classification. The target text collection generally has a very high dimension, thus thetraining process might take a very long time. We propose a feature selection based on the mutual information betweenthe word and class variables to reduce the space dimension of the associative classifiers. In addition, the training processof the associative classifier produces a huge amount of classification rules, which makes the prediction with a newdocument ineffective. We resolve this by introducing a new efficient method for storing and pruning classification rules.This method can also be used when predicting a test document. Experimental results using the 20-newsgroups datasetshow many benefits of the associative classification in both training and predicting when applied to a real worldproblem.� 2006 Elsevier Ltd. All rights reserved.

Keywords: Text classification; Associative classifier; Feature selection; Rule pruning; Subset expansion

1. Introduction

An associative classifier is a classifier using classification rules that are produced through a frequent patternmining process from a training data collection. This process is the same one used in traditional data mining forlarge log data of transactional database. Utilizing associative classifiers in the area of classification task (Agra-wal & Srikant, 1994; Bekkerman, El-Yaniv, Tishby, & Winter, 2001; Yin & Han, 2003) has a relatively shorthistory compared to other classifiers such as Naıve Bayes, k-NN, or SVM. It seems more difficult to find astudy in which an associative classifier is applied in the text classification task.

0306-4573/$ - see front matter � 2006 Elsevier Ltd. All rights reserved.

doi:10.1016/j.ipm.2006.07.012

* Corresponding author. Tel.: +82 54 279 5581; fax: +82 54 279 2299.E-mail addresses: [email protected] (Y. Yoon), [email protected] (G.G. Lee).

Information Processing and Management 43 (2007) 393–405

www.elsevier.com/locate/infoproman

Autho

r's

pers

onal

co

py

When performing a text classification task in a real world situation, the ability to provide abundant inter-pretation on the classification result is often as important as the ability to classify new documents exactly.Classification by a concrete form of rules (‘‘Features! Class’’) has many benefits including this easy interpret-ability. The associative classifier is one of the rule-based classifiers. In contrast, some classifiers such as SVMor Neural Network cannot provide this easy interpretation for the classification result, though they mayachieve excellent classification accuracy.

We can acquire several additional advantages from using rule-based classifier. One is that since the rulescan be expressed in a very intuitive form, humans can easily understand them and can even edit them directlyafter the rules are produced by some inductive learning process. A human expert could delete the weak rulesfrom the original rule set and add new rules that they carefully handcrafted. This can improve the classifica-tion accuracy remarkably with a little bit of added effort. Another is that the rules can be updated incremen-tally by other machine learning processes later.

Another benefit of the associative classifier is that it can exploit the combined information of multiple fea-tures as well as a single feature, while SVM or k-NN classifiers consider only the effects of each single feature.This means that in document classification tasks it is possible to use phrase occurrence information as well asword occurrence information.

To apply an associative classifier to the text classification problem in the real world, however, we need toremove several obstacles encountered during the training and testing phase. One of those is a high dimensionalfeature space. Dataset in the area of text classification, in many cases, has a very large number of features thatare distinct lexical words. For example, the 20-newsgroups test collection has more than one hundred thou-sand lexical word features. Most documents of the 20-newsgroups have more than one hundred words; theyare sparsely distributed in their word feature space. In associative classification, however, we consider all sub-sets of those words. Therefore, the effective number of features grows exponentially, and we cannot take intoaccount all of them due to computational intractability.

To overcome this problem we adopt a feature selection-based dimensionality reduction technique at thesame time maintaining necessary performance in classification. Many well-known methods of dimensionalityreduction exist (Sebastiani, 2002). We used the mutual information measure of the information theory. Fromthe training dataset we calculated the mutual information between the word and the class variables. And weselected words that have high mutual information, and used only those in classifying and neglected theothers.

Another obstacle in associative text classification is the large number of classification rules that are pro-duced in the training phase. Since using all of them becomes both inefficient computationally and ineffectivein classifying, we should select a part of those rules that have high quality. This process has been called Prun-

ing in associative classification. Liu, Hsu, and Ma (1998) proposed a pruning by database coverage, which is akind of validation process using the training set for the purpose of choosing the best classification rulesamong others. Li, Pei, and Han (2001) refined the concept of the database coverage. In addition, they pro-posed two other pruning methods. One is to prune low-ranked rules in terms of the confidence and supportof the rules. The other is to prune the rules in which the correlation between the pattern and the class vari-ables is weak. In this paper, we adopted the pruning methods of Li et al.’s and improved them to work fortext classification.

Related issue of the rule pruning is the prediction of a new document using classification rules. With a largenumber of rules, the prediction result of a test document often shows a split decision between different classes.A method is needed to select one correct class among many in an efficient and effective way. It is not a simpleproblem because if we extract relatively small portion of the rules to avoid many contradicting rules for a doc-ument, we might lose latent candidate classes that may be the correct answer. To handle this problem, Li et al.(2001) used the weighted chi-square method. We try to resolve this problem by simple efficient voting on thedifferent answer classes.

In Section 2 we introduce the general aspects of the associative classification. In Section 3 we explain theoverall architecture of our text classification system using association rules and address the issues such asdimensionality reduction, and rule pruning and prediction from multiple rules. Experimental results andanalyses of text classification using a large dataset are presented in Section 4, and we conclude our worksin Section 5.

394 Y. Yoon, G.G. Lee / Information Processing and Management 43 (2007) 393–405

Autho

r's

pers

onal

co

py

2. Associative classification

2.1. Association rule mining

Associative rules are originated from the market basket analysis in which we seek some patterns of purchas-ing. The term Mining indicates that we should apply much effort to searching the log database to acquire valu-able information.

An association rule is a kind of co-occurrence information on items. Consider a transaction log database ofa large modern retailing market. We want to extract some pattern of co-purchasing of product items from thisdatabase. Let a set of product items be I = {I1, . . . , In} and a transaction be t ˝ I. Then the set of a transactionT = {t1, . . . , tN} ˝ 2I. An association rule is composed of two item sets called an antecedent and a consequent.The consequent often is restricted to containing a single item (Webb, 2003). The rules typically are displayedwith an arrow leading from the antecedent to the consequent:

fI i1; . . . ; I ikg ! fIcg; ð1Þfor example, {plums, lettuce, tomatoes}! {celery}. For an item set A and B, Support(A) is defined as thenumber of t including A divided by N, and Confidence(A! B) as Support(A! B)/Support(A). A user providesthresholds on the support and confidence of a rule denoted as minsup and minconf, respectively.

Definition 2.1 (Association rule). Given an item set X and an item Y, let s be Support(X! Y) and c beConfidence(X! Y). Then, the expression X! Y/(s, c) is an association rule, if s P minsup and c P minconf.

The two constraints about the support and the confidence of a rule imply that we search some level of ‘‘fre-quent’’ patterns. In the training phase of the associative classification, the main task is to extract associationrules, in other words, frequent pattern mining.

Unfortunately, as the number of items grows linearly, the number of the antecedents in the left-hand side ofEq. (1) grows exponentially. Though we can reduce the size of the subset of patterns by the two parameters,minsup and minconf, the search often becomes computationally intractable when we use naıve methods. Manyefficient algorithms were proposed to search frequent patterns more efficiently (Agrawal & Srikant, 1994; Han,Pei, & Yin, 2000). We modified the algorithm by Han et al. (2000), the Frequent Pattern tree growth, andapplied it when we mined frequent patterns.

2.2. Associative classifier

Consider the association rule in the view of a classification rule. Let A = {A1, . . . ,An} be a set of attributedomains, and a data object obj = (a1, . . . ,an) be a sequence of attribute values, i.e. aj 2 Aj, 1 6 j 6 n. Given apattern P ¼ ai1 . . . aik where aij 2 Aij for 1 6 j 6 k and ij 6¼ ij 0 for j 6¼ j 0, a data object obj is said to match patternP if and only if, for 1 6 j 6 k, obj has value aij in attribute Aij .

Definition 2.2 (Associative classifier). Let C = {c1, . . . ,cm} be a set of class labels. An associative classifier isthe mapping R from the set of attribute values to a set of class labels

R : ðA1;A2; . . . ;AnÞ ! C: ð2Þ

According to Eq. (2), given a test datum obj = (a1, . . . ,an), the associative classifier returns a class label c 2 C.

Let a pattern variable be P and a class variable c. If we rewrite the rule in the form of R : P! c and have atraining set T = {(Pi,ci)}, then the learning process is to induce the rule set R for which the element has theSuppot(P! c) >= minsup and Confidence(P! c) >= minconf. The procedure of associative classification rulemining is not much different from that of general association rule mining. One difference is that, in associativeclassification rule mining, the information of the distribution of word patterns matching each class is addition-ally maintained.

Now that we have a classification system, it requires a decision on which class to assign a new test docu-ment. First, we search for the rules in which the pattern matches the document. Next, from these rules, weperform a prediction based on some predefined decision criterion. The details are explained in Section 3.

Y. Yoon, G.G. Lee / Information Processing and Management 43 (2007) 393–405 395

Autho

r's

pers

onal

co

py

3. Text classification with associative classifier

3.1. Overall architecture

The overall system architecture for associative classification is shown in Fig. 1. The left-hand side of thefigure denotes the training process and the right-hand side the testing process.

First, raw data for training is processed to fit to an appropriate form for training. This is called Pre-pro-

cessing. We index every word of training documents and test it for the quality of its contribution to exactlyclassifying the given training documents. Each document is converted into a word-vector format and normal-ized to its length.

From the pre-processed database, we mine frequent patterns, i.e. classification rules. Because the initialnumber of rules is very large, we select a part of them and drop the remaining rules; this process is called Prun-ing. Finally, we construct a classification-rule database with these selected rules.

When a new document comes in to be classified, we convert it into a pattern of words and search the data-base for matching rules. With the rules matched, we decide which class the test document is assigned to.

3.2. Dimensionality reduction by feature selection

Mutual information is defined between the class and the word random variables where we can estimate thedegree of contribution that for a word to classify the documents of a given data collection. According to thedistribution model of words in the document collection, its calculation may differ slightly (McCallum &Nigam, 1998). In this paper we adopted as a document event model the multivariate Bernoulli model in whichwe does not consider the count of word occurrence but its presence in a document. Denote C as the randomvariable for the class label, and Wt as the random variable for the presence or absence of a word wt in a doc-ument. Then the average mutual information of Wt with C is defined as (Cover & Thomas, 1991):

MIðC; W tÞ ¼ HðCÞ � HðC=W tÞ ¼X

c2C

X

ft2f0;1gP ðc; ftÞ log

P ðc; ftÞP ðcÞP ðftÞ

; ð3Þ

where H(C/Wt) is the entropy of C given Wt, and ft 2 {0,1} is an indicator variable denoting the absence orpresence of the word wt. And p(c, ft) is the joint probability, which is calculated as the number of word occur-rences of word wt that also appear in documents with class label c, divided by the total number of wordoccurrences.

Chi-square statistic also provides a measure of dependency between class variable and word variable in thedistribution of a document set. But we adopted mutual information not Chi-square statistic. The reason is thatsince Chi-square statistic judges dependence between a word and a specific class, if we do not have the

Training Documentseg.) 20newsgroups

Pre-processing :-Feature Selection

-Into Word-vector format

Pre-processed

Training:-Frequent Pattern Mining

-Classification Rule ExtractionRule Database

R1: p1 -> c1,R2: p2 -> c2,

…Rn: pn -> cn

New test document

ConversionInto

a Pattern

PatternMatching

Decide theClasses

Alt.atheism or Talk.religion.misc

Rule-1,Rule-2,…Rule-k

Fig. 1. Associative classification – training and testing.


Autho

r's

pers

onal

co

py

classification results it is difficult for the statistic to provide a criterion necessary to evaluate the importance ofthe word.

We selected M words with the highest average mutual information with the class variable among total N

words. In general, we select a value of the parameter M such that M� N. Finally, we convert original trainingdocuments into the documents of word-vector format that has M dimension. Moreover, since the length ofeach document has much variation, we should normalize the length of document in order to reduce somebiases between the assigned classes as much as possible. In this paper, we introduced a parameter L indicatingthe maximum length of a document (by the length we mean the count of distinct words in the document). Weconstruct a transaction record with at most L words which are sorted in the order of descending averagemutual information.

3.3. Extracting and storing classification rules

Before extracting classification rules, we construct the trees of frequent word patterns. This procedure is thesame as the one which is used to construct the frequent pattern mining tree of the previous research (Agrawal& Srikant, 1994). But there is one difference: At the last node of a word pattern, some information about cate-gory is augmented. The information includes the category name, the support and the confidence value of theword pattern. A word pattern tree is depicted in Fig. 2.

In Fig. 2, the set of words is {1, 2, 3, 4} and the set of categories is {a, b, c}. A node in the tree has categoryinformation for more than one class. For example, the top left node of the tree has the counts of occurrencetwo and one on the class ‘a’ and ‘b’ respectively with respect to the word ‘1’. From this tree we perform thefrequent pattern mining process (Han et al., 2000), and then we extract the classification rules which satisfy theminimum support and confidence criterion. As stated previously, these extracted rules are produced in exces-sively large numbers.

To overcome this problem, we introduce a new mechanism to store and retrieve these rules efficiently. Inthis scheme we construct another tree called Classification Rule Tree (CR-tree) apart from the word patterntrees. CR-tree has a similar structure to word pattern tree. The difference is that in CR-tree a node doesnot have any class distribution information but has only one class information at the last node of a word pat-tern path (see Fig. 3).

While executing the frequent pattern mining algorithm (Han et al., 2000), we acquire candidate rules thatmight be pruned and not be inserted into the CR-tree if those rules do not satisfy some conditions. Except theconventional minimum confidence and the support conditions there is an important criterion of generalized

rules. We are required to avoid an overfitting in training phase and thus lessen errors in predicting phase.So, the condition of the generalized rules is as follows: when we insert a new rule into a CR-tree, every rulein the tree that is subset to the candidate rule must have lower rank than the candidate. If not, the candidaterule becomes more specialized and we cannot avoid the overfitting. In our method, differently from the pre-vious pruning methods, the storing and pruning occurs in one step, hence it dramatically reduces the trainingtime.

Root

[1]a:2, b:1

[2]a:1, b:4 ,

c:4

[3]a:2, b:1

[4]a:2, b:1

[3]b:2, c:1

[4]a:1, b:2,

c:3

[4]b:2, c:1

Header Table

node- linkswordHead of

1

2

3

4

Fig. 2. Word pattern tree with class distribution information.


Autho

r's

pers

onal

co

py

3.4. Pruning rules using CR-tree

It is not always helpful to have a large number of rules when we classify a new test document. There is agreater chance of having more than one rule contradicting each other in the answer class. In addition, the rulesmay over fit the training document set. We want to have a small number of the most powerful rules. In thispruning process, duplicate rules are eliminated and rules that might produce wrong classification results areremoved. We perform two types of rule pruning; the first is pruning by rule ranking and the other is bythe Chi-square statistic.

Before we prune the rules by rank, we must first assign a rank to each rule. The rule-ranking criterion is asfollows: (1) The rule with a higher confidence has a higher rank than others. (2) If the confidences are the samebetween two rules, then the one with a higher support has a higher rank than the other. (3) If the supports ofthe two are the same as well, then the one with the fewer number of words in the left-hand side of the rule has ahigher rank. In other words, we prefer ‘‘short’’ rules rather than long ones if other conditions are equal. Theshort length of the rules means general rules, while long rules are prone to over fit. Therefore, we can reducethe test errors by adopting more general rules.

Assume that the eight rules in Table 1 were found as a result of the frequent pattern mining process. Theminsup and minconf of the rules were taken as 3 and 60% respectively. Rule-8 has the highest rank since itsconfidence is the best. Though rule-6 and rule-7 have the same confidence, rule-6 is ranked higher due to thehigher support.

By the pruning criterion of the rule ranking, rule-5 will be pruned because rule-5 is more specific than rule-4but has a lower confidence. However, rule-2 will not be pruned off because it has a higher confidence than themore general rule-1. We can see that the third ranking criterion reflects the generality.

In previous study, after the whole classification rules were generated, then the pruning on the unnecessaryrules was conducted (Liu et al., 1998). In our pruning method utilizing CR-tree, we prune the useless rules atthe same time when we insert the rules. Actually we do not prune but we determine whether to insert or not when

Root

[3][2]

[4]

[7:c ]4.90%

[3:a]3.70%

[4:c ]7.75%

[4:b]3.80%

Header Table

2

3

4

7

node- linkswordHead of

Fig. 3. Pruning with classification rule tree (CR-tree).

Table 1Classification rules and the ranks

Rule–id Rule Sup Conf (%) Rank

1 {1,2,3}! a 28 81 32 {1,2,3,6}! c 4 95 23 {8,9}! b 67 61 84 {2,4,6}! a 120 78 65 {2,4,5,6}! c 105 71 76 {2,3,4,6}! a 58 80 47 {3,4,5,6,7}! d 7 80 58 {7}! d 3 100 1


Autho

r's

pers

onal

co

py

the rule is newly extracted through the frequent pattern mining process. Different from the pruning in Li et al.(2001), our method, fortunately, never makes a newly inserted rule prune the existing rules in a CR-tree, whichmakes the algorithm very simple. The reason for this simplicity is that since the words are sorted in the orderof frequently occurring counts and are processed in that order, the frequent patterns are always generated withthe longer length and are more specialized.

The detailed pruning process is as follows. Assume some classification rules are already stored in a CR-treeand we extract a new candidate rule: ‘‘{2, 4, 7}! c (3, 85%)’’, of which the support is 3 and the confidence is85%. We have to determine whether to insert the rule or not. To do this, first we examine whether the subsetrules in the CR-tree has a higher rank than the candidate rule. This is accomplished by traversing the CR-treefollowing the node links of the header elements that are also the elements of the candidate rule. Fig. 3 presentsthe situation more clearly. Inside a node of Fig. 3, the first line denotes a word code with a class name and thesecond line denotes the value of support and confidence respectively. Class information exists only at the lastnode of a classification rule in the CR-tree.

In a naıve approach to determine whether a certain subset rule has a higher rank, we examine for all thesubset of the antecedent {2, 4, 7} of the candidate rule. This requires to expand the set into its power set, whichtakes O(2n) time if we assume the average number of elements in a rule to be n. However, if we use the CR-tree,we need not expand the set into its power set. Only for those rules in the CR-tree, we examine whether they aresubsets of the candidate rule and have higher ranks. This job takes only O(n logn) time because the length ofthe path is O(logn) and we examine n elements in the node-links.

To do this, in Fig. 3 we follow the node links of ‘2’, ‘4’, and ‘7’. First we follow the node link of ‘7’, then,following the path from 7 to the root, we examine the rules in the middle of the path that have all their ele-ments also in the candidate rule. This procedure is repeated with respect to the word ‘4’ and ‘2’. In Fig. 3, therule which is a subset of the candidate rule is ‘‘{2, 4}! c(7, 75%)’’ and has a lower rank than the candidaterule. Therefore, the candidate rule would be safely inserted into the CR-tree (in other words, it would notbe pruned). In Fig. 4 we summarize this pruning procedure into an algorithm.

Another type of pruning utilizes the Chi-square statistic, which provides the correlation informationbetween two random variables. We want to evaluate the quality of a rule by calculating the Chi-square statisticof the pattern and the class label that are the left-hand and the right-hand side of the rule respectively. We caneasily calculate the Chi-square statistic of each rule during frequent pattern mining. We denote the word pat-tern of a rule as P and the class label as c. Then, we present the number of the documents of the four possiblecases in a box in Table 2.

A denotes the number of all the documents. B denotes the number of the documents with the class label c. D

denotes the number of the documents with matching pattern P, and E denotes the number of the documents

Algorithm Determine_Insertion($r, $c) Input: classification rule tree $r candidate rule $c Output: true or false/* the elements of the header table and the candidate rule are* sorted to the descending order of frequent counts */

BeginFor each element $i of the candidate rule $c

/* follow the node-link of $i in the header table of $r */ For each node $j of the node-link of $i /* proceed upward the pattern path starting from $j */ For each node $k of the path staring from $j to the root If the current node has an element existed in $c ? Then If the node has a class information? Then If it has a higher rank than $c ?

Then Return false //$c will be pruned! Return true //$c will safely be inserted into CR-tree!End

Fig. 4. Algorithm to determine pruning a candidate rule.


Autho

r's

pers

onal

co

py

labeled with class c and also with matching pattern P. The values of all the other cells can be calculated usingthese four values. In addition, we need the expected values of the numbers of documents in the four cellslocated at the center of the table. We can easily calculate these values as well using the ratios of the valuesof the marginal column and row. Finally, the statistic is calculated as follows:

v2 ¼X

i2fourCenterCells

ðobservedi � expectediÞ2

expectedi

ð4Þ

where i denotes the index of four center cells in the table.Now, we can perform a hypothesis test whether the rule is important by the Chi-square statistic. According

to some significance level, we decide whether we select the rule or not.

3.5. Prediction with multiple classification rules

After the training process is finished, we obtain a final set of classification rules. In general, when we predictthe class of a test document, we seek the rules matching the document and the system produces more than onerule to classify with. As well as in the phase of the rule extraction, if we utilize the CR-tree in the predictionphase, we can efficiently find out matching rules to a test document.

The procedure of acquiring matching rules is very similar to the rule pruning procedure. This is not differentto the task of finding out those rules in a CR-tree that are subsets of the test document. First, for every wordelement of the test document, we follow the node link of the word in the header table of the CR-tree. Then,starting from the last node of the path, we search the path for the subset rules climbing up to the root. Wegather all the subset rules in this manner to determine the category of the test document. Fig. 5 shows the rulematching and classification algorithm.

Generally, the number of the matched classification rules is very large, which may lead to a difficult situ-ation. If all of the rules have identical class labels, the problem is simple; we assign that class to the document.But if we have many different classes from the matched rules, we need to decide on one rule as the correct one.

Table 2Calculation of the Chi-square statistic of a rule

Class c NOT class c Total

Match P E D � E DNOT match P B � E A � B � D + E A � D

Total B A � B A

Algorithm Predict_Class($r, $d) Input: classification rule tree $r test document $d //all are sorted as in Fig 4 Output: array of estimated class scores $s[$c]Begin

For each element $i of the test document $d/* follow the node-link of $i in the header table of $r */

For each node $j of the node-link of $i /* proceed upward the pattern path starting from $j */ For each node $k of the path staring from $j to the root If the current node has an element existed in $d ? Then If the node has a class information? Then 1. identify the class as $c

2. Add the weight of the rule into $s[$c] Sort $s[] to the order of the highest score and return it

End

Fig. 5. Rule matching and classification algorithm.


Autho

r's

pers

onal

co

py

For example, assume that from Table 1 we acquired rule-2, rule-4, and rule-6 as matched rules of a testdocument. We have a split decision between class A and class C. According to the rule ranking criteria, wewould select C as an answer class. However, inspecting more deeply, though the confidence of rule-2 is slightlybetter than the other two, the support values of the two are much higher than that of rule-2. Therefore, weknow that we cannot always reliably select rule-2 as a correct answer.

Therefore, it is dangerous that we estimate a class label only by the rule ranking system. So, we adopt anmajority-voting method when deciding on a correct class for a test document di from multiple classificationrules. Assume that we have K rules which are matched with a test document and have a form of rk: Pk! cj

for 1 6 k 6 K where cj is an element of the set of the possibly correct classes, Ci ¼ fc1; c2; . . . ; cjCijg; jCij 6 K.And let Sj be the score by which the class cj is estimated to be the correct class. With the majority voting weselect the class label c such that:

c ¼ argmaxcj2Ci

Sj ð5Þ

and

Sj ¼X

rk2Rj

wðrkÞ; ð6Þ

where Rj is the set of the rules of which consequent class is cj and w(rk) is the weight value of the rule rk. In thesimplest form of the majority voting, a constant 1 is used as a weight value regardless of the rules. We alsotried applying several variations of the majority voting method. We consider a method which adds as a weightthe confidence value of the rule instead of the constant 1. This confidence value can be thought of a contri-bution of each classification rule to deciding the correct class. Among the many variations, the one which addsthe square of the confidence of each rule showed the best classification performance.

In this scoring scheme, it is very rare for the classes to have the equal scores in Eq. (6) because there arehundreds of thousands rules and each decimal number of the rule’s score sums up different total values in mostcases. If a tie occurred nevertheless, we would regard all the classes as answers.

4. Experiments and analyses

We performed various experiments on the associative classification using the 20 Newsgroups document col-lection (Lang, 1995). This collection is slightly multi-labeled; 541 documents of the total 19,997 documents areposted to more than one newsgroup.

We pre-processed the raw texts into word vectors. For this purpose, we used the BOW toolkit (McCallum,1996). We removed general stop words, but did no stemming. During the training process, we included onlythe body part and the Subject line of the articles because other parts may contain the words that may indicatethe answer class directly. We reduced the dimension of word feature space of the original 20 Newsgroups tothree thousands, which was originally over one hundred thousand. One fourth of the dataset was used for test-ing and the remainder was used for training.

We implemented with C++ the procedures of our associative classification rule storing, pruning and pre-diction while the frequent pattern mining codes are based on Goethals (2003). We executed our code on aLinux machine with a 2.2 GHz CPU and 2 GBytes of memory. Samples of the classification rules are listedin Table 3, and the best classification results are shown in Table 4.

Table 3Sample classification rules for 20 newsgroups

Words ! Class Support Confidence

Article, christianity, source talk.religion.misc 3 0.75Apr, system comp.sys.ibm.pc.hardware 32 0.1306Don, megs, year rec.sport.baseball 31 0.31Years, car sci.med 23 0.1932Sale, offer, doesn misc.forsale 3 1.0


Autho

r's

pers

onal

co

pyThe overall performance of the system is a little bit lower than that of the current state-of-the-art research

for the same 20 Newsgroups data set (Bekkerman et al., 2001; Yoon, Lee, & Lee, 2006). However, the five ofthe twenty classes show higher accuracy compared with those of the state-of-the-art systems (* is appended atthe end of their names in Table 4). At the last column of Table 4, we show the potential accuracy that we haveacquired by considering the second and the third majority classes as answers as well as the first one. This factshows that there is a big room for some improvement on the classification performance in the future.

In addition, since the rules are expressed in the intuitive form of word strings (refer to Eq. (2)), we can man-ually edit the rules and improve the classification accuracy with little effort. For example, we may add thewords listed in Table 3 that could best represent the target class. This is important in a practical applicationof the classifier since the real performance can be varied with the characteristic of the domain and the test data.

Notice that the training time is very short; this is remarkable compared to the case of SVM or even Naıve–Bayes classifiers. Let the maximum length of a document be L, the size of the selected word features M, andthe number of the whole words in the training collection N. In general, we take these parameters asL�M� N. The time complexity in the training using the whole words is O(2N). In the cases of SVM, thecomplexity is O(N2). Our training time in this paper is O(2L), which becomes much shorter than that ofSVM if we select an appropriate L to be much smaller than M and far smaller than N through the featureselection.

The relation between the maximum length L of the training documents and the classification performance isshown in Table 5. As the number L increases, we can acquire better classification accuracies while we need

Table 4Classification performance for 20 newsgroups

Class label # Rules P (%) R (%) F1 (%) Potential Top-3 F1

alt.atheism 68,271 92.9 80.4 86.2 95.2comp.graphics 37,303 79.1 71.7 75.2 88.8comp.os.ms-windows.misc 92,190 72.0 87.3 78.9 94.7comp.sys.ibm.pc.hardware 76,357 76.3 68.4 72.1 86.8comp.sys.mac.hardware 45,571 84.7 83.7 84.2 93.6comp.window.x 47,451 84.1 78.1 81.0 92.0misc.forsale 36,488 80.6 90.6 85.3 96.0rec.autos 54,116 85.2 89.8 87.5 95.2rec.motorcycles* 33,475 92.2 94.0 93.1 97.2rec.sport.baseball 67,202 94.2 90.8 92.5 96.8rec.sport.hockey 64,814 90.3 97.2 93.6 98.4sci.crypt* 65,670 90.9 96.8 93.8 97.2sci.electronics 32,254 85.5 75.0 79.9 92.4sci.med 36,580 89.6 88.8 89.2 94.4sci.space* 45,237 89.7 93.7 91.7 97.6soc.religion.misc 141,804 74.3 90.3 81.5 95.1talk.politics.guns 102,865 82.2 92.1 86.9 98.4talk.politics.mideast* 102,492 93.3 95.0 94.1 98.4talk.politics.misc* 80,884 97.0 64.8 77.7 90.4talk.religion.misc 72,326 95.6 71.3 81.7 91.6

Total 1,303,349 86.3 84.3 85.3 94.5

Table 5Relation between document length and performance

Document length, L Training time (min) F1 (%) Potential top-3 F1

10 2 69.9 79.315 6 79.7 90.520 25 82.0 92.525 222 84.1 94.030 4198 85.3 94.5


Autho

r's

pers

onal

co

py

more time in the training phase. Since the time increases exponentially, we need to have a bound in acquiringbetter classification performance. In our experiment the reasonable number of the length L was 25. In the case

0 200 400 600 800 1000 1200 1400 160060

65

70

75

80

85

Acc

urac

y (%

)

# Rules (x1000)

Fig. 6. Relation between # of rules and the accuracy.

0 5 10 15 20 25 30 350

200

400

600

800

1000

1200

Document Length, L

# R

ules

(x1

000)

Fig. 7. Relation between document length and # rules.


Autho

r's

pers

onal

co

py

of the larger numbers we failed to get any classification result within a reasonable time. We expect that we beable to enlarge the number L through improving the algorithms of frequent pattern mining and classificationrule pruning and others.

Fig. 6 shows the relation between the number of classification rules and the overall accuracy. The morerules we have the higher accuracy we can achieve. However, as the number of the rules increases, the classi-fication time is increased as well. In addition, we can see an overfitting to appear in Fig. 6. The increase in thenumber of the rules slows the prediction process too due to the longer rule matching time.

If the document length L gets larger, we have more rules to classify with. Fig. 7 represents the number of therules in relation to the document length L. If we have the more rules, we can achieve the better classificationperformance. But we cannot use an unlimited number of rules because it takes too much time in the training.Hence it is very important that we utilize as many effective classification rules as possible by applying efficientrule-pruning methods to diminish the amount of trivial and noisy rules.

5. Conclusion

Associative classification is a new method in the area of document classification. The expression of the clas-sification rule is easy and human-readable. Therefore, it presents an excellent interpretation on the classifica-tion result as well as considerable effectiveness. In addition, the construction of the classification framework issimple, and the training time of the classifier is very short. A critical shortcoming of the associative classifica-tion is an excessive number of rules produced in the training process. We overcome this by starting with thedimensionality reduction technique using the average mutual information of the word features. The relativelyshort training time can be achieved by applying our new method of storing and pruning the classification rulesusing the efficient CR-tree structure. In predicting with new documents, our majority voting method that con-siders the confidence values of the classification rules is very helpful to increase the classification accuracy.Moreover, by conducting various classification experiments on the large data collection, we showed that thisassociative classification framework could be well applied to many real world applications.

With these many advantages of the associative classification, there are still some areas for further improve-ment. We plan to study in depth the feature selection method so that we can acquire more satisfactory accu-racy in classification. In addition, to overcome the limit in the number of word features in the trainingdocuments, a more efficient frequent pattern mining and rule pruning method should be required. It will alsobe helpful to improve the classification accuracy.

Acknowledgement

This research was supported by the MIC (Ministry of Information and Communication), Korea, under theITRC (Information Technology Research Center) support program supervised by the IITA (Institute of Infor-mation Technology Assessment).

References

Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In Proceedings of very large data bases (vol. 12,pp. 487–499). San Francisco: Morgan Kaufmann.

Bekkerman, R., El-Yaniv, R., Tishby, N., & Winter, Y. (2001). On feature distributional clustering for text categorization. In Proceedings

of SIGIR 2001 (vol. 24, pp. 146–153). New York: ACM Press.Cover, T., & Thomas, J. (1991). Elements of information theory. New York: Wiley.Goethals, B. (2003). Frequent Pattern Mining implementation. <http://www.adrem.ua.ac.be/~goethals/software/>.Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. In Proceedings of the 2000 ACM SIGMOD

international conference on management of data (pp. 1–12). New York: ACM Press.Lang, K. (1995). NEWSWEEDER: learning to filter netnews. In Proceedings of 12th international conference on machine learning

(pp. 331–339). San Mateo, CA: Morgan Kaufmann Publishers Inc.Li, W., Pei, J., & Han, J. (2001). CMAR: accurate and efficient classification based on multiple class-association rules. In Proceedings of

the 2001 IEEE international conference on data mining (pp. 369–376). Washington, DC: IEEE Computer Society.Liu, B., Hsu, W., & Ma, Y. (1998). Integrating classification and association rule mining. In Proceedings of the fourth international

conference on knowledge discovery and data mining (pp. 80–86). New York: ACM Press.


Autho

r's

pers

onal

co

py

McCallum, A. (1996). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. Available from <http://www.cs.cmu.edu/~mccallum/bow>.

McCallum, A., & Nigam, K. (1998). A comparison of event models for naive Bayes text classification. In Proceedings of the 10th

Conference on European Chapter of the Association for Computational Linguistics (vol. 1, pp. 307–314). Morristown, NJ: Associationfor Computational Linguistics.

Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.Webb, G. (2003). Association rules. In N. Ye (Ed.), The handbook of data mining (pp. 25–38). Mahwah, NJ: Lawrence Erlbaum Associates.Yin, X., & Han, J. (2003). CPAR: classification based on predictive association rules. In Proceedings of 2003 SIAM international

conference on data mining (SDM’03)’ (pp. 369–376). New York: SIAM Press.Yoon, Y., Lee, C., & Lee, G. (2006). An effective procedure for constructing a hierarchical text classification system. Journal of the

American Society for Information Science and Technology, 57(3), 431–442.

Yongwook Yoon earned his MS degree of Computer Engineering in 2004. Now he is a Ph.D. student in the department of ComputerScience and Engineering of POSTECH. He also works as a senior engineer in Information Technology division of KT. His researchinterest includes document categorization, knowledge representation and knowledge management.

Gary Geunbae Lee received his Ph.D. in Computer Science from UCLA in 1991 and now is a professor in POSTECH. He has served as atechnical committee member and reviewer for several international conferences and journals. His current research interests include naturallanguage processing such as speech recognition, spoken language understanding, and TTS systems.