15
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 6, JUNE 2014 1415 On the Use of Side Information for Mining Text Data Charu C. Aggarwal, Fellow, IEEE Yuchen Zhao, and Philip S. Yu, Fellow, IEEE Abstract—In many text mining applications, side-information is available along with the text documents. Such side-information may be of different kinds, such as document provenance information, the links in the document, user-access behavior from web logs, or other non-textual attributes which are embedded into the text document. Such attributes may contain a tremendous amount of information for clustering purposes. However, the relative importance of this side-information may be difficult to estimate, especially when some of the information is noisy. In such cases, it can be risky to incorporate side-information into the mining process, because it can either improve the quality of the representation for the mining process, or can add noise to the process. Therefore, we need a principled way to perform the mining process, so as to maximize the advantages from using this side information. In this paper, we design an algorithm which combines classical partitioning algorithms with probabilistic models in order to create an effective clustering approach. We then show how to extend the approach to the classification problem. We present experimental results on a number of real data sets in order to illustrate the advantages of using such an approach. Index Terms—Data mining, clustering 1 I NTRODUCTION T HE problem of text clustering arises in the context of many application domains such as the web, social net- works, and other digital collections. The rapidly increasing amounts of text data in the context of these large online collections has led to an interest in creating scalable and effective mining algorithms. A tremendous amount of work has been done in recent years on the problem of clus- tering in text collections [5], [11], [27], [30], [37] in the database and information retrieval communities. However, this work is primarily designed for the problem of pure text clustering, in the absence of other kinds of attributes. In many application domains, a tremendous amount of side- information is also associated along with the documents. This is because text documents typically occur in the con- text of a variety of applications in which there may be a large amount of other kinds of database attributes or meta- information which may be useful to the clustering process. Some examples of such side-information are as follows: In an application in which we track user access behavior of web documents, the user-access behavior may be captured in the form of web logs. For each document, the meta-information may correspond to the browsing behavior of the different users. Such C. C. Aggarwal is with the Department of Computer Science, IBM T. J. Watson Research Center, Yorktown Heights, NY 10532 USA. E-mail: [email protected]. Y. Zhao is with the Sumo Logic Inc, Redwood City, CA 94063 USA. E-mail: [email protected]. P. S. Yu is with the Department of Computer Science, University of Illinois at Chicago, Chicago, IL 60607 USA. E-mail: [email protected]. Manuscript received 31 Mar. 2012; revised 4 July 2012; accepted 8 July 2012. Date of publication 22 July 2012; date of current version 29 May 2014. Recommended for acceptance by J. Gehrke, B.C. Ooi, and E. Pitoura. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier 10.1109/TKDE.2012.148 logs can be used to enhance the quality of the min- ing process in a way which is more meaningful to the user, and also application-sensitive. This is because the logs can often pick up subtle correlations in con- tent, which cannot be picked up by the raw text alone. Many text documents contain links among them, which can also be treated as attributes. Such links contain a lot of useful information for mining pur- poses. As in the previous case, such attributes may often provide insights about the correlations among documents in a way which may not be easily acces- sible from raw content. Many web documents have meta-data associated with them which correspond to different kinds of attributes such as the provenance or other infor- mation about the origin of the document. In other cases, data such as ownership, location, or even tem- poral information may be informative for mining purposes. In a number of network and user-sharing applications, documents may be associated with user-tags, which may also be quite informative. While such side-information can sometimes be useful in improving the quality of the clustering process, it can be a risky approach when the side-information is noisy. In such cases, it can actually worsen the quality of the mining pro- cess. Therefore, we will use an approach which carefully ascertains the coherence of the clustering characteristics of the side information with that of the text content. This helps in magnifying the clustering effects of both kinds of data. The core of the approach is to determine a clustering in which the text attributes and side-information provide sim- ilar hints about the nature of the underlying clusters, and at the same time ignore those aspects in which conflicting hints are provided. 1041-4347 c 2012 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

side info for mining text data

Embed Size (px)

Citation preview

Page 1: side info for mining text data

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 6, JUNE 2014 1415

On the Use of Side Information for MiningText Data

Charu C. Aggarwal, Fellow, IEEE Yuchen Zhao, and Philip S. Yu, Fellow, IEEE

Abstract—In many text mining applications, side-information is available along with the text documents. Such side-information maybe of different kinds, such as document provenance information, the links in the document, user-access behavior from web logs, orother non-textual attributes which are embedded into the text document. Such attributes may contain a tremendous amount ofinformation for clustering purposes. However, the relative importance of this side-information may be difficult to estimate, especiallywhen some of the information is noisy. In such cases, it can be risky to incorporate side-information into the mining process, becauseit can either improve the quality of the representation for the mining process, or can add noise to the process. Therefore, we need aprincipled way to perform the mining process, so as to maximize the advantages from using this side information. In this paper, wedesign an algorithm which combines classical partitioning algorithms with probabilistic models in order to create an effectiveclustering approach. We then show how to extend the approach to the classification problem. We present experimental results on anumber of real data sets in order to illustrate the advantages of using such an approach.

Index Terms—Data mining, clustering

1 INTRODUCTION

THE problem of text clustering arises in the context ofmany application domains such as the web, social net-

works, and other digital collections. The rapidly increasingamounts of text data in the context of these large onlinecollections has led to an interest in creating scalable andeffective mining algorithms. A tremendous amount of workhas been done in recent years on the problem of clus-tering in text collections [5], [11], [27], [30], [37] in thedatabase and information retrieval communities. However,this work is primarily designed for the problem of pure textclustering, in the absence of other kinds of attributes. Inmany application domains, a tremendous amount of side-information is also associated along with the documents.This is because text documents typically occur in the con-text of a variety of applications in which there may be alarge amount of other kinds of database attributes or meta-information which may be useful to the clustering process.Some examples of such side-information are as follows:

• In an application in which we track user accessbehavior of web documents, the user-access behaviormay be captured in the form of web logs. For eachdocument, the meta-information may correspond tothe browsing behavior of the different users. Such

• C. C. Aggarwal is with the Department of Computer Science, IBMT. J. Watson Research Center, Yorktown Heights, NY 10532 USA.E-mail: [email protected].

• Y. Zhao is with the Sumo Logic Inc, Redwood City, CA 94063 USA.E-mail: [email protected].

• P. S. Yu is with the Department of Computer Science, University ofIllinois at Chicago, Chicago, IL 60607 USA. E-mail: [email protected].

Manuscript received 31 Mar. 2012; revised 4 July 2012; accepted 8 July 2012.Date of publication 22 July 2012; date of current version 29 May 2014.Recommended for acceptance by J. Gehrke, B.C. Ooi, and E. Pitoura.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier 10.1109/TKDE.2012.148

logs can be used to enhance the quality of the min-ing process in a way which is more meaningful to theuser, and also application-sensitive. This is becausethe logs can often pick up subtle correlations in con-tent, which cannot be picked up by the raw textalone.

• Many text documents contain links among them,which can also be treated as attributes. Such linkscontain a lot of useful information for mining pur-poses. As in the previous case, such attributes mayoften provide insights about the correlations amongdocuments in a way which may not be easily acces-sible from raw content.

• Many web documents have meta-data associatedwith them which correspond to different kinds ofattributes such as the provenance or other infor-mation about the origin of the document. In othercases, data such as ownership, location, or even tem-poral information may be informative for miningpurposes. In a number of network and user-sharingapplications, documents may be associated withuser-tags, which may also be quite informative.

While such side-information can sometimes be useful inimproving the quality of the clustering process, it can be arisky approach when the side-information is noisy. In suchcases, it can actually worsen the quality of the mining pro-cess. Therefore, we will use an approach which carefullyascertains the coherence of the clustering characteristics ofthe side information with that of the text content. This helpsin magnifying the clustering effects of both kinds of data.The core of the approach is to determine a clustering inwhich the text attributes and side-information provide sim-ilar hints about the nature of the underlying clusters, andat the same time ignore those aspects in which conflictinghints are provided.

1041-4347 c© 2012 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: side info for mining text data

1416 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 6, JUNE 2014

In order to achieve this goal, we will combine a par-titioning approach with a probabilistic estimation process,which determines the coherence of the side-attributes inthe clustering process. A probabilistic model on the sideinformation uses the partitioning information (from textattributes) for the purpose of estimating the coherenceof different clusters with side attributes. This helps inabstracting out the noise in the membership behavior ofdifferent attributes. The partitioning approach is specificallydesigned to be very efficient for large data sets. This canbe important in scenarios in which the data sets are verylarge. We will present experimental results on a number ofreal data sets, and illustrate the effectiveness and efficiencyof the approach.

While our primary goal in this paper is to study theclustering problem, we note that such an approach canalso be extended in principle to other data mining prob-lems in which auxiliary information is available with text.Such scenarios are very common in a wide variety of datadomains. Therefore, we will also propose a method in thispaper in order to extend the approach to the problem classi-fication. We will show that the extension of the approach tothe classification problem provides superior results becauseof the incorporation of side information. Our goal is toshow that the advantages of using side-information extendbeyond a pure clustering task, and can provide competitiveadvantages for a wider variety of problem scenarios.

This paper is organized as follows. The remainder ofthis section will present the related work on the topic. Inthe next section, we will formalize the problem of textclustering with side information. We will also present analgorithm for the clustering process. We will show howto extend these techniques to the classification problem inSection 3. In Section 4, we will present the experimentalresults. Section 5 contains the conclusions and summary.

1.1 Related WorkThe problem of text-clustering has been studied widely bythe database community [18], [25], [34]. The major focusof this work has been on scalable clustering of multi-dimensional data of different types [18], [19], [25], [34].A general survey of clustering algorithms may be foundin [21]. The problem of clustering has also been stud-ied quite extensively in the context of text-data. A surveyof text clustering methods may be found in [3]. One ofthe most well known techniques for text-clustering is thescatter-gather technique [11], which uses a combinationof agglomerative and partitional clustering. Other relatedmethods for text-clustering which use similar methods arediscussed in [27], [29]. Co-clustering methods for text dataare proposed in [12], [13]. An Expectation Maximization(EM) method for text clustering has been proposed in [22].Matrix-factorization techniques for text clustering are pro-posed in [32]. This technique selects words from the docu-ment based on their relevance to the clustering process, anduses an iterative EM method in order to refine the clusters.A closely related area is that of topic-modeling, event track-ing, and text-categorization [6], [9], [15], [16]. In this context,a method for topic-driven clustering for text data has beenproposed in [35]. Methods for text clustering in the con-text of keyword extraction are discussed in [17]. A number

of practical tools for text clustering may be found in [23].A comparative study of different clustering methods maybe found in [30].

The problem of text clustering has also been studied incontext of scalability in [5], [20], [37]. However, all of thesemethods are designed for the case of pure text data, anddo not work for cases in which the text-data is combinedwith other forms of data. Some limited work has been doneon clustering text in the context of network-based linkageinformation [1], [2], [8], [10], [24], [31], [33], [36], thoughthis work is not applicable to the case of general side-information attributes. In this paper, we will provide a firstapproach to using other kinds of attributes in conjunctionwith text clustering. We will show the advantages of usingsuch an approach over pure text-based clustering. Such anapproach is especially useful, when the auxiliary informa-tion is highly informative, and provides effective guidancein creating more coherent clusters. We will also extend themethod to the problem of text classification, which has beenstudied extensively in the literature. Detailed surveys ontext classification may be found in [4], [28].

2 CLUSTERING WITH SIDE INFORMATION

In this section, we will discuss an approach for clusteringtext data with side information. We assume that we have acorpus S of text documents. The total number of documentsis N, and they are denoted by T1 . . . TN. It is assumed thatthe set of distinct words in the entire corpus S is denotedby W . Associated with each document Ti, we have a set ofside attributes Xi. Each set of side attributes Xi has d dimen-sions, which are denoted by (xi1 . . . xid). We refer to suchattributes as auxiliary attributes. For ease in notation andanalysis, we assume that each side-attribute xid is binary,though both numerical and categorical attributes can eas-ily be converted to this format in a fairly straightforwardway. This is because the different values of the categoricalattribute can be assumed to be separate binary attributes,whereas numerical data can be discretized to binary valueswith the use of attribute ranges. Some examples of suchside-attributes are as follows:

• In a web log analysis application, we assume thatxir corresponds to the 0-1 variable, which indicateswhether or not the ith document has been accessedby the rth user. This information can be used in orderto cluster the web pages in a site in a more informa-tive way than a techniques which is based purelyon the content of the documents. As in the previouscase, the number of pages in a site may be large, butthe number of documents accessed by a particularuser may be relatively small.

• In a network application, we assume that xir corre-sponds to the 0-1 variable corresponding to whetheror not the ith document Ti has a hyperlink to therth page Tr. If desired, it can be implicitly assumedthat each page links to itself in order to maximizelinkage-based connectivity effects during the clus-tering process. Since hyperlink graphs are large andsparse, it follows that the number of such auxiliaryvariables are high, but only a small fraction of themtake on the value of 1.

Page 3: side info for mining text data

AGGARWAL ET AL.: ON THE USE OF SIDE INFORMATION FOR MINING TEXT DATA 1417

• In a document application with associated GPS orprovenance information, the possible attributes maybe drawn on a large number of possibilities. Suchattributes will naturally satisfy the sparsity property.

As noted in the examples above, such auxiliary attributesare quite sparse in many real applications. This can be achallenge from an efficiency perspective, unless the sparsityis carefully taken into account during the clustering process.Therefore, our techniques will be designed to account forsuch sparsity. However, it is possible to easily design ourapproach for non-sparse attributes, by treating the attributevalues in a more symmetric way.

We note that our technique is not restricted to binaryauxiliary attributes, but can also be applied to attributesof other types. When the auxiliary attributes are of othertypes (quantitative or categorical), they can be convertedto binary attributes with the use of a simple transforma-tion process. For example, numerical data can be discretizedinto binary attributes. Even in this case, the derived binaryattributes are quite sparse especially when the numericalranges are discretized into a large number of attributes. Inthe case of categorical data, we can define a binary attributefor each possible categorical value. In many cases, the num-ber of such values may be quite large. Therefore, we willdesign our techniques under the implicit assumption thatsuch attributes are quite sparse. The formulation for theproblem of clustering with side information is as follows:

Text Clustering with Side Information: Given a corpusS of documents denoted by T1 . . . TN, and a set of auxiliary vari-ables Xi associated with document Ti, determine a clustering ofthe documents into k clusters which are denoted by C1 . . . Ck,based on both the text content and the auxiliary variables.

We will use the auxiliary information in order to pro-vide additional insights, which can improve the quality ofclustering. In many cases, such auxiliary information maybe noisy, and may not have useful information for the clus-tering process. Therefore, we will design our approach inorder to magnify the coherence between the text contentand the side-information, when this is detected. In cases, inwhich the text content and side-information do not showcoherent behavior for the clustering process, the effects ofthose portions of the side-information are marginalized.

2.1 The COATES AlgorithmIn this section, we will describe our algorithm for text clus-tering with side-information. We refer to this algorithm asCOATES throughout the paper, which corresponds to thefact that it is a COntent and Auxiliary attribute based TExtcluStering algorithm. We assume that an input to the algo-rithm is the number of clusters k. As in the case of alltext-clustering algorithms, it is assumed that stop-wordshave been removed, and stemming has been performed inorder to improve the discriminatory power of the attributes.The algorithm requires two phases:

• Initialization: We use a lightweight initializationphase in which a standard text clustering approachis used without any side-information. For this pur-pose, we use the algorithm described in [27]. Thereason that this algorithm is used, because it is asimple algorithm which can quickly and efficiently

provide a reasonable initial starting point. The cen-troids and the partitioning created by the clustersformed in the first phase provide an initial startingpoint for the second phase. We note that the firstphase is based on text only, and does not use theauxiliary information.

• Main Phase: The main phase of the algorithm isexecuted after the first phase. This phase startsoff with these initial groups, and iteratively recon-structs these clusters with the use of both the textcontent and the auxiliary information. This phaseperforms alternating iterations which use the textcontent and auxiliary attribute information in orderto improve the quality of the clustering. We callthese iterations as content iterations and auxiliaryiterations respectively. The combination of the twoiterations is referred to as a major iteration. Eachmajor iteration thus contains two minor iterations, cor-responding to the auxiliary and text-based methodsrespectively.

The focus of the first phase is simply to construct an ini-tialization, which provides a good starting point for theclustering process based on text content. Since the key tech-niques for content and auxiliary information integrationare in the second phase, we will focus most of our subse-quent discussion on the second phase of the algorithm. Thefirst phase is simply a direct application of the text cluster-ing algorithm proposed in [27]. The overall approach usesalternating minor iterations of content-based and auxiliaryattribute-based clustering. These phases are referred to ascontent-based and auxiliary attribute-based iterations respec-tively. The algorithm maintains a set of seed centroids,which are subsequently refined in the different iterations.In each content-based phase, we assign a document to itsclosest seed centroid based on a text similarity function.The centroids for the k clusters created during this phaseare denoted by L1 . . . Lk. Specifically, the cosine similarityfunction is used for assignment purposes. In each auxil-iary phase, we create a probabilistic model, which relatesthe attribute probabilities to the cluster-membership prob-abilities, based on the clusters which have already beencreated in the most recent text-based phase. The goal of thismodeling is to examine the coherence of the text clusteringwith the side-information attributes. Before discussing theauxiliary iteration in more detail, we will first introducesome notations and definitions which help in explain-ing the clustering model for combining auxiliary and textvariables.

We assume that the k clusters associated with the dataare denoted by C1 . . . Ck. In order to construct a probabilis-tic model of membership of the data points to clusters, weassume that each auxiliary iteration has a prior probabil-ity of assignment of documents to clusters (based on theexecution of the algorithm so far), and a posterior prob-ability of assignment of documents to clusters with theuse of auxiliary variables in that iteration. We denote theprior probability that the document Ti belongs to the clus-ter Cj by P(Ti ∈ Cj). Once the pure-text clustering phasehas been executed, the a-priori cluster membership prob-abilities of the auxiliary attributes are generated with theuse of the last content-based iteration from this phase. The

Page 4: side info for mining text data

1418 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 6, JUNE 2014

apriori value of P(Ti ∈ Cj) is simply the fraction of docu-ments which have been assigned to the cluster Cj. In orderto compute the posterior probabilities P(Ti ∈ Cj|Xi) of mem-bership of a record at the end of the auxiliary iteration, weuse the auxiliary attributes Xi which are associated withTi. Therefore, we would like to compute the conditionalprobability P(Ti ∈ Cj|Xi). We will make the approximationof considering only those auxiliary attributes (for a partic-ular document), which take on the value of 1. Since weare focussing on sparse binary data, the value of 1 for anattribute is a much more informative event than the defaultvalue of 0. Therefore, it suffices to condition only on thecase of attribute values taking on the value of 1. For exam-ple, let us consider an application in which the auxiliaryinformation corresponds to users which are browsing spe-cific web pages. In such a case, the clustering behavior isinfluenced much more significantly by the case when a userdoes browse a particular page, rather than one in which theuser does not browse a particular page, because most pageswill typically not be browsed by a particular user. This isgenerally the case across many sparse data domains suchas attributes corresponding to links, discretized numericdata, or categorical data which is quite often of very highcardinality (such as zip codes).

Furthermore, in order to ensure the robustness of theapproach, we need to eliminate the noisy attributes. Thisis especially important, when the number of auxiliaryattributes is quite large. Therefore, at the beginning ofeach auxiliary iteration, we compute the gini-index of eachattribute based on the clusters created by the last content-based iteration. This gini-index provides a quantification ofthe discriminatory power of each attribute with respect tothe clustering process. The gini-index is computed as fol-lows. Let frj be the fraction of the records in the clusterCj (created in the last content-based iteration), for whichthe attribute r takes on the value of 1. Then, we com-pute the relative presence prj of the attribute r in cluster j asfollows:

prj = frj∑k

m=1 frm. (1)

The values of prj are defined, so that they sum to 1 over aparticular attribute r and different clusters j. We note thatwhen all values of prj take on a similar value of 1/k, thenthe attribute values are evenly distributed across the dif-ferent clusters. Such an attribute is not very discriminativewith respect to the clustering process, and it should not beused for clustering. While the auxiliary attributes may havea different clustering behavior than the textual attributes,it is also expected that informative auxiliary attributes areat least somewhat related to the clustering behavior of thetextual attributes. This is generally true of many applica-tions such as those in which auxiliary attributes are definedeither by linkage-based patterns or by user behavior. On theother hand, completely noisy attributes are unlikely to haveany relationship to the text content, and will not be veryeffective for mining purposes. Therefore, we would like thevalues of prj to vary across the different clusters. We refer tothis variation as skew. The level of skew can be quantifiedwith the use of the gini-index. The gini-index of attribute r

is denoted by Gr, and is defined as follows:

Gr =k∑

j=1

p2rj. (2)

The value of Gr lies between 1/k and 1. The more discrim-inative the attribute, the higher the value of Gr. In eachiteration, we use only the auxiliary attributes for which thegini-index is above a particular threshold γ . The value ofγ is picked to be 1.5 standard deviations below the meanvalue of the gini-index in that particular iteration. We notethat since the clusters may change from one iteration to thenext, and the gini-index is defined with respect to the cur-rent clusters, the values of the gini-index will also changeover the different iterations. Therefore, different auxiliaryattributes may be used over different iterations in the clus-tering process, as the quality of the clusters become morerefined, and the corresponding discriminative power ofauxiliary attributes can also be computed more effectively.

Let Ri be a set containing the indices of the attributesin Xi which are considered discriminative for the cluster-ing process, and for which the value of the correspondingattribute is 1. For example, let us consider an applicationin which we have 1000 different auxiliary attributes. If thedimension indices of the attributes in the vector Xi whichtake on the value of 1 are 7, 120, 311, and 902 respectively,then we have Ri = {7, 120, 311, 902}. Therefore, instead ofcomputing P(Ti ∈ Cj|Xi), we will compute the conditionalprobability of membership based on a particular value ofthe set Ri. We define this quantity as the attribute subsetbased conditional probability of cluster membership.

Definition 1. The attribute-subset based conditional probabilityof cluster membership of the document Ti to the cluster Cj isdefined as the conditional probability of membership of docu-ment Ti to cluster Cj based only on the set of attributes Rifrom Xi which take on the value of 1.

The attribute-subset based conditional probability of docu-ment Ti to cluster Cj is denoted by Ps(Ti ∈ Cj|Ri). We notethat if Ri contains a modest number of attributes, then it ishard to directly approximate Ps(Ti ∈ Cj|Ri) in a data-drivenmanner. We note that these are the posterior probabilities ofcluster membership after the auxiliary-attribute based itera-tion, given that particular sets of auxiliary attribute values arepresented in the different documents. We will use these pos-terior probabilities in order to re-adjust the cluster centroidsduring the auxiliary attribute-based iterations. The a-prioriprobabilities of membership are assigned based on the currentcontent-based iteration of the clustering, and are denotedby Pa(Ti ∈ Cj). The a-priori and a-posteriori probabilities arerelated as follows:

Ps(Ti ∈ Cj|Ri) = Pa(Ri|Ti ∈ Cj) · Pa(Ti ∈ Cj)/Pa(Ri). (3)

The above equation follows from the simple expansionof the expression for the conditional probabilities. We willalso see that each expression on the right-hand side canbe estimated in a data-driven manner with the use of theconditional estimates from the last iteration. Specifically, thevalue of Pa(Ti ∈ Cj) is simply the fraction of the documentsassigned to the cluster Cj in the last content-based (minor)iteration.

Page 5: side info for mining text data

AGGARWAL ET AL.: ON THE USE OF SIDE INFORMATION FOR MINING TEXT DATA 1419

In order to evaluate posterior probabilities fromEquation 3, we also need to estimate Pa(Ri) and Pa(Ri|Ti ∈Cj). Since Ri may contain many attributes, this is essen-tially a joint probability, which is hard to estimate exactlywith a limited amount of data. Therefore, we make a naiveBayes approximation in order to estimate the values of Pa(Ri)

and Pa(Ri|Ti ∈ Cj). This approximation has been shown toachieve high quality results in a wide variety of practicalscenarios [14]. In order to make this approximation, weassume that the different attributes of Ri are independentof one another. Then, we can approximate P(Ri) as follows:

Pa(Ri) =∏

r∈Ri

Pa(xir = 1). (4)

This value of Pa(xir = 1) is simply the fraction of the doc-uments in which the value of the attribute xir is one. Thiscan be easily estimated from the underlying document col-lection. Similarly, we use the independence assumption inorder to estimate the value of Pa(Ri|Ti ∈ Cj).

Pa(Ri|Ti ∈ Cj) =∏

r∈Ri

Pa(xir = 1|Ti ∈ Cj). (5)

The value of Pa(xir = 1|Ti ∈ Cj) is the fraction of the doc-uments in cluster Cj in which the value of attribute xir isone. This value can be estimated from the last set of clustersobtained from the content-based clustering phase. For eachcluster Cj, we determine the fraction of records for whichthe value of rth auxiliary attribute is 1. We note that the onlydifference between the second definition and the first is thefact that in the second case, we are computing the fractionof the documents in the cluster j for which the rth attributeis 1. Then, we can substitute the results of Equations 4 and 5in Equation 3 in order to obtain the following:

Ps(Ti ∈ Cj|Ri) = Pa(Ti ∈ Cj) ·∏

r∈Ri

Pa(xir = 1|Ti ∈ Cj)

Pa(xir = 1). (6)

In order to simplify the expression further in order to makeit more intuitive, we compute the interest ratio I(r, j) asthe expression

Pa(xir=1|Ti∈Cj)

Pa(xir=1). This is the interest ratio for the

rth auxiliary attribute in relation to the jth cluster Cj. Weformally define the interest ratio as follows:

Definition 2. The interest ratio I(r, j) defines the importance ofattribute r with respect to the jth cluster Cj This is defined asthe ratio of the probability of a document belonging to clus-ter Cj, when the rth auxiliary attribute of the document isset to 1, to the ratio of the same probability unconditionally.Intuitively, the ratio indicates the proportional increase in like-lihood of a document belonging to a particular cluster (fromthe current set of clusters) because of a particular auxiliaryattribute value. In other words, we have:

I(r, j) = Pa(xir = 1|Ti ∈ Cj)

Pa(xir = 1). (7)

A value of I(r, j), which is significantly greater than 1 indi-cates that the auxiliary attribute r is highly related to thecurrent cluster Cj. The break-even value of I(r, j) is one, andit indicates that the auxiliary attribute r is not specially relatedto the cluster Cj. We can further simplify the expression for

the conditional probability in Equation 6 as follows:

Ps(Ti ∈ Cj|Ri) = Pa(Ti ∈ Cj) ·∏

r∈Ri

I(r, j). (8)

We note that the values of Ps(Ti ∈ Cj|Ri) should sum to 1over the different values of (cluster index) j . However, thismay not be the case in practice, because of the use of the inde-pendence approximation while computing the probabilities forthe different attributes. Therefore, we normalize the posteri-ori probabilities to Pn(Ti ∈ Cj|Ri), so that they add up to 1,as follows:

Pn(Ti ∈ Cj|Ri) = Ps(Ti ∈ Cj|Ri)∑k

m=1 Ps(Ti ∈ Cm|Ri). (9)

These normalized posterior probabilities are then used in orderto re-adjust the centroids L1 . . . Lk. Specifically, each documentTi is assigned to the corresponding centroid Lj with a prob-ability proportional to Pn(Ti ∈ Cj|Ri), and the text contentof the document Ti is added to the cluster centroid Lj of Cjwith a scaling factor proportional to Pn(Ti ∈ Cj|Ri). We notethat such an iteration supervises the text content of the cen-troids on the basis of the auxiliary information, which in turnaffects the assignment of documents to centroids in the nextcontent-based iteration. Thus, this approach iteratively tries tobuild a consensus between the text content-based and auxiliaryattribute-based assignments of documents to clusters.

In the next content-based iteration, we assign the docu-ments to the modified cluster-centroids based on the cosinesimilarity of the documents to the cluster centroids [26].Each document is assigned to its closest cluster centroidbased on the cosine similarity. The assigned documents arethen aggregated in order to create a new centroid meta-document which aggregates the frequency of the words inthe documents for that cluster. The least frequent words inthis cluster are then pruned away, so as to use a vector ofonly the most frequent words in the cluster centroid withtheir corresponding frequencies. This new assignment ofdocuments to clusters is again used for defining the a-prioriprobabilities for the auxiliary attributes in the next itera-tion. A key issue for the algorithm is the convergence ofthe algorithm towards a uniform solution. In order to com-pute convergence, we assume that we have an identifierassociated with each cluster in the data. This identifier doesnot change from one iteration to the next for a particularcentroid. Within the tth major iteration, we compute the fol-lowing quantities for each document for the two differentminor iterations:

• We compute the cluster identifier to which the doc-ument Ti was assigned in the content-based step ofthe tth major iteration. This is denoted by qc(i, t).

• We compute the cluster identifier to which the doc-ument Ti had the highest probability of assignment inthe auxiliary-attribute set of the tth major iteration.This is denoted by qa(i, t).

In order to determine when the iterative process should ter-minate, we would like the documents to have assignmentsto similar clusters in the (t − 1)th and tth steps at the endof both the auxiliary-attribute and content-based steps. Inother words, we would like qc(i, t − 1), qa(i, t − 1) qc(i, t)

Page 6: side info for mining text data

1420 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 6, JUNE 2014

Fig. 1. COATES algorithm.

and qa(i, t) to be as similar as possible. Therefore, we com-pute the number of documents for which all four of thesequantities are the same. As the algorithm progresses, thenumber of such records will initially increase rapidly, andthen slowly level off. The algorithm is assumed to haveterminated, when the number of such records does notincrease by more than 1% from one iteration to the next.At this point, the algorithm is assumed to have reached acertain level of stability in terms of its assignment behavior,and it can therefore be terminated. An important point tobe remembered is that the output to the algorithm are boththe vectors qc(·, t) and qa(·, t). While the clustering processis inherently designed to converge to clusters which useboth content and auxiliary information, some of the docu-ments cannot be made to agree in the clustering behaviorwith the use of different criteria. The common assignmentsin qc(·, t) and qa(·, t) correspond to the cases in which

the content-based and auxiliary-assignments can be madeto agree with a well-designed clustering process, and thedifferences in the two vectors correspond to those doc-uments which show different clustering behavior for theauxiliary and content-based information. Such documentsare interesting, because they provide intensional understand-ing of how some of the content-based information maybe different from the auxiliary information in the cluster-ing process. The overall clustering algorithm is illustratedin Fig. 1.

2.2 Smoothing IssuesAn important smoothing issue arises in process of evalu-ating the right hand side of Equation 6. Specifically, the

expression denoted by∏

r∈Ri

Pa(xir=1|Ti∈Cj)

Pa(xir=1)may contain zero

values for Pa(xir = 1|Ti ∈ Cj). Even a single such zero valuecould set the whole expression to zero. This will resultin an ineffective clustering process. In order to avoid thisproblem, we use a smoothing parameter εr for the rth aux-iliary attribute. Specifically, the expression in Equation 6.Therefore, the corresponding expression is evaluated by∏

r∈Ri

Pa(xir=1|Ti∈Cj)+εrPa(xir=1)+εr

. The value of εr is fixed to within asmall fraction of Pa(xir = 1).

2.3 Time ComplexityThe time complexity of the approach is dominated by thetime required for the content-based and auxiliary minoriterations. We will determine this running time in termsof the number of clusters k, the number of words in thetext lexicon dt, the number of auxiliary attributes d, andthe total number of documents N. In each iteration ofthe approach, O(k) cosine distance computations are per-formed. For N documents, we require O(N · k) cosinecomputations. Since each cosine computation may requireO(dt) time, this running time is given by O(N·k·dt). In addi-tion, each iteration requires us to compute the similaritywith the auxiliary attributes. The major difference from thecontent-based computation is that this requires O(d) time.Therefore, the total running time required for each iterationis O(N ·k · (d+dt)). Therefore, the overall running time maybe obtained by multiplying this value with the total num-ber of iterations. In practice, a small number of iterations(between 3 to 8) are sufficient to reach convergence in mostscenarios. Therefore, in practice, the total running time isgiven by O(N · k · (d + dt)).

3 EXTENSION TO CLASSIFICATION

In this section, we will discuss how to extend the approachto classification. We will extend our earlier clusteringapproach in order to incorporate supervision, and create amodel which summarizes the class distribution in the datain terms of the clusters. Then, we will show how to use thesummarized model for effective classification.

First, we will introduce some notations and definitionswhich are specific to the classification problem. As before,we assume that we have a text corpus S of documents.The total number of documents in the training data is N,and they are denoted by T1 . . . TN. Associated with each

Page 7: side info for mining text data

AGGARWAL ET AL.: ON THE USE OF SIDE INFORMATION FOR MINING TEXT DATA 1421

text document Ti, we also have a training label l′i, whichis drawn from {1 . . . k}. As before, we assume that the sideinformation associated with the ith training document isdenoted by Xi. It is assumed that a total of N′ test docu-ments are available, which are denoted by T′

1 . . . T′N′ . It is

assumed that the side information associated with the ithdocument is denoted by X′

i.We refer to our algorithm as the COLT algorithm

throughout the paper, which refers to the fact that it is aCOntent and auxiLiary attribute-based Text classification algo-rithm. The algorithm uses a supervised clustering approachin order to partition the data into k different clusters. Thispartitioning is then used for the purposes of classification.The steps used in the training algorithm are as follows:

• Feature Selection: In the first step, we use featureselection to remove those attributes, which are notrelated to the class label. This is performed both forthe text attributes and the auxiliary attributes.

• Initialization: In this step, we use a supervised k-means approach in order to perform the initializa-tion, with the use of purely text content. The maindifference between a supervised k-means initializa-tion, and an unsupervised initialization is that theclass memberships of the records in each cluster arepure for the case of supervised initialization. Thus,the k-means clustering algorithm is modified, so thateach cluster only contains records of a particularclass.

• Cluster-Training Model Construction: In this phase,a combination of the text and side-information isused for the purposes of creating a cluster-basedmodel. As in the case of initialization, the purity ofthe clusters in maintained during this phase.

Once the set of supervised clusters are constructed, theseare used for the purposes of classification. We will discusseach of these steps in some detail below.

Next, we will describe the training process for the COLTalgorithm. The first step in the training process is to createa set of supervised clusters, which are then leveraged forthe classification.

The first step in the supervised clustering process is toperform the feature selection, in which only the discrimina-tive attributes are retained. In this feature selection process,we compute the gini-index for each attribute in the datawith respect to the class label. If the gini index is γ stan-dard deviations (or more) below the average gini index ofall attributes, then these attributes are pruned globally, andare never used further in the clustering process. With someabuse of notation, we can assume that the documents Tiand auxiliary attributes Xi refer to these pruned represen-tations. We note that this gini index computation is differentfrom the gini-index computation with respect to the aux-iliary attributes. The latter is performed during the mainphase of the algorithm.

Once the features have been selected, the initializationof the training procedure is performed only with the con-tent attributes. This is achieved by applying a k-means typealgorithm as discussed in [27] to the approach, except thatclass label constraints are used in the process of assign-ing data points to clusters. Each cluster is associated with

a particular class and all the records in the cluster belongto that class. This goal is achieved by first creating unsu-pervised cluster centroids, and then adding supervision tothe process. In order to achieve this goal, the first two iter-ations of the k-means type algorithm are run in exactlythe same way as in [27], where the clusters are allowedto have different class labels. After the second iteration,each cluster centroid is strictly associated with a classlabel, which is identified as the majority class in that clus-ter at that point. In subsequent iterations, the records areconstrained to only be assigned to the cluster with the asso-ciated class label. Therefore, in each iteration, for a givendocument, its distance is computed only to clusters whichhave the same label as the document. The document isthen assigned to that cluster. This approach is continuedto convergence.

Once the initialization has been performed, the mainprocess of creating supervised clusters with the use of acombination of content and auxiliary attributes is started.As in the previous case, we use two minor iterationswithin a major iteration. One minor iteration correspondsto content-based assignment, whereas another minor itera-tion corresponds to an auxiliary attribute-based assignment.The main difference is that class-based supervision is usedin the assignment process. For the case of content-basedassignment, we only assign a document to the closest clus-ter centroid, which belongs to the same label. For the case ofthe auxiliary minor iteration, we compute the prior proba-bility Pa(Ti ∈ Cj) and the posterior probability Ps(Ti ∈ Cj|Ri),as in the previous case, except that this is done only forcluster indices which belong to the same class label. Thedocument is assigned to one of the cluster indices with thelargest posterior probability. Thus, the assignment is alwaysperformed to a cluster with the same label, and each clustermaintain homogeneity of class distribution. As in the pre-vious case, this approach is applied to convergence. Theoverall training algorithm is illustrated in Fig. 2.

Once the supervised clusters have been created, they canbe used for the purpose of classification. The supervisedclusters provide an effective summary of the data which canbe used for classification purposes. Both the content- andauxiliary- information is used in this classification process.In order to perform the classification, we separately computethe r closest clusters to the test instance T′

i with the use of bothcontent and auxiliary attributes. Specifically, for the case ofcontent attributes, we use the similarity based on the contentattributes, whereas, for the case of the auxiliary attributes, wedetermine the probability Ps(T′

i ∈ Cj|R′i) for the test instance.

In the former case, the r clusters with largest similarity aredetermined, whereas in the latter case, the r clusters withthe largest posterior probability are determined. We notethat each of these 2 · r clusters is associated with a classlabel. The number of labels for the different classes arecomputed from these 2 · r possibilities. The label with thelargest presence from these 2 · r clusters is reported as therelevant class label for the test instance T′

i. The overallprocedure for classification is illustrated in Fig. 3.

3.1 Time ComplexityThe time complexity of the COLT algorithm is very simi-lar to the COATES algorithm, because the COLT algorithm

Page 8: side info for mining text data

1422 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 6, JUNE 2014

Fig. 2. COLT training process.

is essentially implemented as a supervised version of theCOATES algorithm. One major advantage of the COLTalgorithm over the COATES algorithm is that the clus-ters are class-specific, and therefore the a training docu-ment only needs to be compared to clusters of the sameclass. However, from a worst-case perspective, this couldstill require a comparison to O(k) clusters. Therefore, the

Fig. 3. COLT classification process with the use of the supervisedclusters.

worst-case time-complexity of the COLT algorithm is stillthe same, which is given by O(N · k · (d + dt)).

4 EXPERIMENTAL RESULTS

In this section, we compare our clustering and classifi-cation methods against a number of baseline techniqueson real and synthetic data sets. We refer to our cluster-ing approach as COntent and Auxiliary attribute based TExtcluStering (COATES). As the baseline, we used two differ-ent methods: (1) An efficient projection based clusteringapproach [27] which adapts the k-means approach to text.This approach is widely known to provide excellent clus-tering results in a very efficient way. We refer to thisalgorithms as SchutzeSilverstein [text only] in all figure leg-ends in the experimental section. (2) We adapt the k-meansapproach with the use of both text and side informationdirectly. We refer to this baseline as K-Means [text+side] inall figure legends.

For the case of the classification problem, we tested theCOLT methods against the following baseline methods: (1)We tested against a Naive Bayes Classifier which uses onlytext. (2) We tested against an SVM classifier which uses onlytext. (3) We tested against a supervised clustering methodwhich uses bboth text and side information.

Thus, we compare our algorithms with baselines whichare chosen in such a way that we can evaluate the advan-tage of our approach over both a pure text-mining methodand a natural alternative which uses both text and side-information. In order to adapt the k-means approach to thecase where both text and side-information is used, the auxil-iary attributes were simply used as text-content in the formof “pseudo-words” in the collection. This makes it relativelysimple to modify the k-means algorithm to that case. Wewill show that our approach has significant advantages forboth the clustering and classification problems.

4.1 Data SetsWe used three real data sets in order to test our approach.The data sets used were as follows:

Page 9: side info for mining text data

AGGARWAL ET AL.: ON THE USE OF SIDE INFORMATION FOR MINING TEXT DATA 1423

(1) Cora Data Set: The Cora data set1 contains 19,396 sci-entific publications in the computer science domain. Eachresearch paper in the Cora data set is classified into a topichierarchy. On the leaf level, there are 73 classes in total.We used the second level labels in the topic hierarchy, andthere are 10 class labels, which are Information Retrieval,Databases, Artificial Intelligence, Encryption and Compression,Operating Systems, Networking, Hardware and Architecture,Data Structures Algorithms and Theory, Programming andHuman Computer Interaction. We further obtained two typesof side information from the data set: citation and author-ship. These were used as separate attributes in orderto assist in the clustering process. There are 75,021 cita-tions and 24,961 authors. One paper has 2.58 authorsin average, and there are 50,080 paper-author pairs intotal.

(2) DBLP-Four-Area Data Set: The DBLP-Four-Area dataset [31] is a subset extracted from DBLP that contains fourdata mining related research areas, which are database,data mining, information retrieval and machine learning.This data set contains 28,702 authors, and the texts are theimportant terms associated with the papers that were pub-lished by these authors. In addition, the data set containedinformation about the conferences in which each authorpublished. There are 20 conferences in these four areasand 44,748 author-conference pairs. Besides the author-conference attribute, we also used co-authorship as anothertype of side information, and there were 66,832 coauthorpairs in total.

(3) IMDB Data Set: The Internet Movie DataBase(IMDB) is an online collection2 of movie information. Weobtained ten-year movie data from 1996 to 2005 from IMDBin order to perform text clustering. We used the plots ofeach movie as text to perform pure text clustering. Thegenre of each movie is regarded as its class label. Weextracted movies from the top four genres in IMDB whichwere labeled by Short, Drama, Comedy, and Documentary.We removed the movies which contain more than twoabove genres. There were 9,793 movies in total, which con-tain 1,718 movies from the Short genre, 3,359 movies fromthe Drama genre, 2,324 movies from the Comedy genre and2,392 movies from the Documentary genre. The names ofthe directors, actors, actresses, and producers were usedas categorical attributed corresponding to side information.The IMDB data set contained 14,374 movie-director pairs,154,340 movie-actor pairs, 86,465 movie-actress pairs and36,925 movie-producer pairs.

4.2 Evaluation MetricsThe aim is to show that our approach is superior to natu-ral clustering alternatives with the use of either pure textor with the use of both text and side information. In eachdata set, the class labels were given, but they were notused in the clustering process. For each class, we com-puted the cluster purity, which is defined as the fractionof documents in the clusters which correspond to its dom-inant class. The average cluster purity over all clusters(weighted by cluster size) was reported as a surrogate for

1. http://www.cs.umass.edu/~mccallum/code-data.html2. http://www.imdb.com

the quality of the clustering process. Let the number ofdata points in the k clusters be denoted by n1 . . . nk. Wedenote the dominant input cluster label in the k clusters byl1 . . . lk. Let the number of data points with input clusterlabel li be denoted by ci. Then, the overall cluster purity Pis defined by the fraction of data points in the clusteringwhich occur as a dominant input cluster label. Therefore,we have:

P =∑k

i=1 ci∑k

i=1 ni. (10)

The cluster purity always lies between 0 and 1. Clearly, aperfect clustering will provide a cluster purity of almost1, whereas a poor clustering will provide very low valuesof the cluster purity. For efficiency, we tested the executiontime of our method with respect to the baseline over threereal data sets.

Since the quality of content-based clustering variesbased on random initialization, we repeated each test tentimes with different random seeds and reported the aver-age as the final score. Unless otherwise mentioned, thedefault value of the number of input clusters used forthe Cora, DBLP-Four-Area and IMDB data sets were 16,8 and 6 respectively. These default sizes were chosenbased on the size of the underlying data set. All resultswere tested on a Debian GNU/Linux server. The systemused double dual-core 2.4 GHz Opteron processors with4GB RAM.

4.3 Effectiveness ResultsIn this section, we will present the effectiveness results forall three data sets. The effectiveness results for the twobaseline algorithms and COATES algorithms with increas-ing number of clusters for the CORA, DBLP and IMDBdata sets are illustrated in Fig. 4(a), (c) and (e) respec-tively. The number of clusters is illustrated on the X-axis,whereas the cluster purity with respect to the ground truthis illustrated on the Y-axis. Since the data sets are quitenoisy, the purity numbers are somewhat low both for theCOATES algorithm and the baselines. We notice that thepurity will slightly increase when the number of clustersincreases on all three data sets. This is quite natural, sincelarger number of clusters results in a finer granularity ofpartitioning. We further notice that COATES outperformsthe baselines on all three data sets by a wide margin interms of purity. In the Cora and IMDB data sets, therewas not much difference between the two baselines (whichdiffer in terms of their use of side information or not),because the k-means baseline was unable to leverage thelow frequency side information effectively, and the effectsof the text content dominated the clustering results. Thisshows that the side-information is not only important inthe clustering process, but it is also important to leverageit properly in order to obtain the maximum gains dur-ing the clustering. The improvement in quality over bothbaselines remains competitive throughout different settingson the number of clusters. This suggests that the COATESalgorithm is consistently superior to the different baselines,irrespective of the value of this parameter. This suggeststhat the COATES scheme is able to use the side information

Page 10: side info for mining text data

1424 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 6, JUNE 2014

Fig. 4. Effectiveness results on all data sets. (a) CORA (Effectiveness vs. number of clusters). (b) CORA (Effectiveness vs. data size). (c) DBLP(Effectiveness vs. number of clusters). (d) DBLP (Effectiveness vs. data size). (e) IMDB (Effectiveness vs. number of clusters). (f) IMDB(Effectiveness vs. data size).

in a robust and consistent way over different numbers ofclusters.

We also tested the effectiveness of the method withincreasing data size. This is done by sampling a portionof the data, and reporting the results for different sam-ple sizes. The effectiveness results with increasing data sizefor the Cora, DBLP, and IMDB data sets are illustrated inFig. 4(b), (d) and (f) respectively. In all figures, the X-axisillustrates the size of data and the Y-axis represents the clus-ter purity. We find that the improvement in quality with theuse of side information is more pronounced for the caseof larger data sets. In particular, we can see that at thelower end of the X-axis (corresponding to data sampleswhich are only about 20% of the full size), the effective-ness of adding side information is insignificant in terms ofimprovement. That is because of the fact that the sparseside information associated with data sets becomes lesseffective when the data set itself is very small. In suchcases, most side information attributes appear very infre-quently and thus, it is difficult to leverage them in orderto obtain better clustering assignments. When the data sizeincreases to around 40% of the whole data, it has enoughside information to improve the quality of clustering. Asthe size of the data set increases, the improvement in qual-ity becomes even more significant. We can also infer thatCOATES works especially well when the data set is fairlylarge, because in such cases there are lots of repetitionsof the same attribute value of a particular kind of sideinformation, especially when it is sparse. The repetition ofthe same attribute value of the side information is criti-cal in being able to leverage it for improving clusteringquality. In addition the COATES algorithm was consis-tently superior to both the baselines in terms of clusteringquality.

4.4 Efficiency ResultsIn this section, we present the running times for COATESand the two different baselines. We note that all approachesuse the same text preprocessing methods, such as stopswords removal, stemming and term frequency compu-tation, and the time required for side-information pre-processing is negligible. Therefore, the pre-processing timefor both methods is virtually the same for all methods, andwe present the running times only for the clustering por-tions in order to sharpen the comparisons and make themmore meaningful. We further note that since the COATESalgorithm is expected to incorporate side-information intothe clustering process in a much more significant way thanboth the baselines, its running times are correspondinglyexpected to be somewhat higher. The goal is to show thatthe overheads associated with the better qualitative resultsof the COATES algorithm are somewhat low. We first testedthe efficiency of the different methods with respect to thenumber of clusters, and the results of the CORA, DBLPand IMDB data sets are reported in Fig. 5(a), (c) and (e)respectively. The number of clusters are illustrated on theX-axis, and the running time is illustrated on the Y-axis.From the figures, it is evident that the COATES algorithmconsumes more running time compared with the base-lines, though it is only slightly slower than both baselines.The reason is that COATES method focusses on the sideinformation in a much more focussed way, which slightlyincreases its running time. For example, in the Cora dataset, the side information comprises 50,080 paper-authorpairs and 75,021 citing paper-cited paper pairs. It is impor-tant to understand that the COATES clustering processneeds to create a separate iteration in order to effectivelyprocess the side information. Therefore, the slight over-head of 10%- to 30% on the running time of the COATES

Page 11: side info for mining text data

AGGARWAL ET AL.: ON THE USE OF SIDE INFORMATION FOR MINING TEXT DATA 1425

Fig. 5. Efficiency results on all data sets. (a) CORA (Efficiency vs. number of clusters). (b) CORA (Efficiency vs. data size). (c) DBLP (Efficiency vs.number of clusters). (d) DBLP (Efficiency vs. data size). (e) IMDB (Efficiency vs. number of clusters). (f) IMDB (Efficiency vs. data size).

algorithm is quite acceptable. From the figures, it is evidentthat the COATES algorithm scales linearly with increas-ing number of clusters for all data sets. This is becausethe distance function computations and posterior probabil-ity computations scale linearly with increasing number ofclusters.

It is also valuable to test the scalability of the pro-posed approach with increasing data size. The results forthe Cora, DBLP, and IMDB data sets are illustrated inFig. 5(b), (d) and (f) respectively. In all figures, the X-axis illustrates the size of data, and the Y-axis illustratesthe running time. As in the previous case, the COATESalgorithm consumes a little more time than the baselinealgorithms. These results also show that the running timescales linearly with increasing data size. This is because thenumber of distance function computations and posteriorprobability computations scale linearly with the number ofdocuments in each iteration. The linear scalability impliesthat the technique can be used very effectively for large textdata sets.

4.5 Sensitivity AnalysisWe also tested the sensitivity of the COATES algorithm withrespect to two important parameters. We will present thesensitivity results on the Cora and DBLP-Four-Area data sets.As mentioned in the algorithm in Fig. 1, we used thresh-old γ to select discriminative auxiliary attributes. While thedefault value of the parameter was chosen to be 1.5, we alsopresent the effects of varying this parameter. Fig. 6(a) and(b) show the effects on cluster purity by varying the valueof threshold γ from 0 to 2.5 on the Cora and DBLP-Four-Areadata sets. In both figures, the X-axis illustrates the value ofthreshold γ , and the Y-axis presents the purity. The resultsare constant for both baseline methods because they do notuse this parameter. It is evident from both figures that set-ting the threshold γ too low results in purity degradation,

since the algorithm will prune the auxiliary attributes tooaggressively in this case. On both data sets, the COATESalgorithm achieves good purity results when γ is set to be1.5. Further increasing the value of γ will reduce the purityslightly because setting γ too high will result in also includ-ing noisy attributes. Typically by picking γ in the rangeof (1.5, 2.5), the best results were observed. Therefore, thealgorithm shows good performance for a fairly large range

Fig. 6. Sensitivity analysis with threshold γ . (a) Cora data set. (b) DBLP-four-area data set.

Page 12: side info for mining text data

1426 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 6, JUNE 2014

Fig. 7. Sensitivity analysis with smoothing parameter ε. (a) Cora dataset. (b) DBLP-four-area data set.

of values of γ . This suggests that the approach is quiterobust.

Fig. 7(a) and (b) illustrate the impact of the smoothingparameter ε on the effectiveness of the clustering process.The value of the smoothing parameter ε is illustrated on alogarithmic scale on the X-axis, whereas the purity is illus-trated on the Y-axis. Although the X-axis is on a logarithmicscale, the purity is quite stable across all ranges. But wehave to note that ε should not be set to too large a value,because this will reduce the positive effects of the additionalknowledge added by the auxiliary variables. However, therobustness of the approach over large ranges of values of ε

shows that the method can be used quite safely for differentvalues of the parameter ε.

4.6 Extension to ClassificationWe also tested the classification accuracy of theCOLTClassify method, which uses both text and side-information. As baselines, the following algorithms weretested (a) A Naive Bayes Classifier3, (b) An SVM Classifier4,and (c) A supervised k-means method which is basedon both text and side information. In the last case, theclassification is performed by using the nearest clusterbased on text+side.

For each of the data sets, we used 10-fold cross-validation to evaluate the classification model. Clearly, theaccuracy of such a model would depend upon the underly-ing model parameters. For example, the number of clustersin the model may affect the accuracy of the clustering pro-cess. In Fig. 8, we have illustrated the variation in accuracy

3. http://www.cs.waikato.ac.nz/ml/weka/4. http://www.csie.ntu.edu.tw/~cjlin/libsvm/

Fig. 8. Classification accuracy with the number of clusters. (a) Cora dataset. (b) DBLP-four-area data set. (c) IMDB data set.

with the number of clusters for each of the three data sets.The results for the CORA, DBLP and IMDB data sets areillustrated in Fig. 8(a), (b) and (c) respectively. The val-ues of r, γ and ε were set to 2, 1.5 and 0.001 respectively.The number of clusters are illustrated on the X-axis, andthe classification accuracy is illustrated on the Y-axis. Wenote that the SVM and Bayes classifiers are not depen-dent on clustering, and therefore the classification accuracyis a horizontal line. In each case, it can be clearly seenthat the accuracy of the COLTClassify method was signif-icantly higher than all the other methods. There were somevariations in the classification accuracy across the differentmethods for different data sets. However, the COLTClassifymethod retained the largest accuracy over all data sets, andwas quite robust to the number of clusters. This suggeststhat the method is quite robust both to the choice of thedata set and the model parameters.

Next, we test the efficiency of our classification scheme.Since our scheme uses both text and side information,and therefore has a much more complex model, it isreasonable to assume that it would require more time than

Page 13: side info for mining text data

AGGARWAL ET AL.: ON THE USE OF SIDE INFORMATION FOR MINING TEXT DATA 1427

Fig. 9. Classification efficiency with the number of clusters. (a) Cora dataset. (b) DBLP-four-area data set. (c) IMDB data set.

(the simpler) text-only models. However, we will show thatour scheme continues to retain practical running times,in light of their substantial effectiveness. We present therunning time of the classification scheme and baselines inFig. 9(a)–(c). The number of clusters is illustrated on theX-axis, and the running time in seconds is shown on theY-axis. Similar to the results on classification accuracy, theSVM and Bayes classifiers are horizontal lines in the figuresbecause they are not dependent on the number of clusters.The running time of clustering based methods (supervisedk-means and COLTClassify) increases with a larger num-ber of clusters. It is clear that the Bayes classifier achievesthe best efficiency due to the simplicity of its model, andfollowed by the SVM classifier. We note that both the super-vised k-means method and COLTClassify consume longertime. The reasons are both methods process more dataincluding text as well as side information, and they areiterative approaches extended from clustering methods.In addition, COLTClassify is a more complex model thanthe supervised k-means algorithm, and therefore consumesmore time. However, considering the effectiveness gained

Fig. 10. Variation of classification accuracy with the threshold γ . (a) Coradata set. (b) DBLP-four-area data set.

by COLTClassify, the overhead in running time is quiteacceptable.

We also tested the sensitivity of the classification schemeto the parameters γ and ε. The sensitivity of the schemewith respect to the parameter γ for the Cora and DBLPdata sets is presented in Fig. 10(a) and (b) respectively.The threshold γ is illustrated on the X-axis, and theclassification accuracy is illustrated on the Y-axis. The base-lines are also illustrated in the same Figure. It is evidentthat for most of the ranges of the parameter γ , the schemecontinues to perform much better than the baseline. Theonly case where it does not do as well is the case wherethe feature selection threshold is chosen to be too small.This is because the feature selection tends to be too aggres-sive for those cases, and this leads to a loss of accuracy.Further increasing the value of γ beyond 1.5 will reducethe accuracy slightly because setting γ too high will resultin also including noisy attributes. However, for most of therange, the COLTClassify technique tends to retain its effec-tiveness with respect to the other methods. In general, sincethe value of γ is expressed in terms of normalized standarddeviations, it is expected to not vary too much with dataset. In our experience, the approach worked quite well forγ in the range (1.5, 2.5). This tends to suggest the high levelof robustness of the scheme to a wide range of the choiceof threshold parameter γ .

The sensitivity of the scheme with respect to the smooth-ing parameter ε for the Cora and DBLP data sets is pre-sented in Fig. 11(a) and (b) respectively. The smoothingparameter ε is illustrated on the X-axis, and the accuracyis illustrated on the Y-axis. We note that the smoothingparameter was not required for the other schemes, andtherefore the accuracy is presented as a horizontal line in

Page 14: side info for mining text data

1428 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 6, JUNE 2014

Fig. 11. Variation of classification accuracy with the smoothing factor.(a) Cora data set. (b) DBLP-four-area data set.

those cases. For the entire range of values for the smooth-ing parameter ε, the COLTClassify method performs muchmore effectively with respect to the other schemes. In fact,the classification accuracy did not change very much acrossthe entire range of the smoothing parameter. Therefore, ourtechnique is very robust to the choice of the smoothingparameter ε as well.

5 CONCLUSION AND SUMMARY

In this paper, we presented methods for mining text datawith the use of side-information. Many forms of text-databases contain a large amount of side-information ormeta-information, which may be used in order to improvethe clustering process. In order to design the clusteringmethod, we combined an iterative partitioning techniquewith a probability estimation process which computes theimportance of different kinds of side-information. This gen-eral approach is used in order to design both clusteringand classification algorithms. We present results on realdata sets illustrating the effectiveness of our approach. Theresults show that the use of side-information can greatlyenhance the quality of text clustering and classification,while maintaining a high level of efficiency.

ACKNOWLEDGMENTS

The research of the first author was sponsored by theArmy Research Laboratory and was accomplished underCooperative Agreement W911NF-09-2-0053. The views andconclusions contained in this document are those of theauthor and should not be interpreted as representingthe official policies, either expressed or implied, of theArmy Research Laboratory or the US Government. TheUS Government is authorized to reproduce and distribute

reprints for Government purposes notwithstanding anycopyright notice hereon. The research of Y. Zhao andP.S. Yu was supported in part by US National ScienceFoundation through grants IIS-0905215, IIS-0914934, CNS-1115234, DBI-0960443, and OISE-1129076, US Departmentof Army through grant W911NF-12-1-0066, Google Mobile2014 Program, and KAU grant. This paper is an extendedversion of the IEEE ICDE Conference paper [7] for the “Bestof ICDE Special Issue.

REFERENCES

[1] C. C. Aggarwal and H. Wang, Managing and Mining Graph Data.New York, NY, USA: Springer, 2010.

[2] C. C. Aggarwal, Social Network Data Analytics. New York, NY,USA: Springer, 2011.

[3] C. C. Aggarwal and C.-X. Zhai, Mining Text Data. New York, NY,USA: Springer, 2012.

[4] C. C. Aggarwal and C.-X. Zhai, “A survey of text classificationalgorithms,” in Mining Text Data. New York, NY, USA: Springer,2012.

[5] C. C. Aggarwal and P. S. Yu, “A framework for clustering mas-sive text and categorical data streams,” in Proc. SIAM Conf. DataMining, 2006, pp. 477–481.

[6] C. C. Aggarwal, S. C. Gates, and P. S. Yu, “On using partialsupervision for text categorization,” IEEE Trans. Knowl. Data Eng.,vol. 16, no. 2, pp. 245–255, Feb. 2004.

[7] C. C. Aggarwal and P. S. Yu, “On text clustering with sideinformation,” in Proc. IEEE ICDE Conf., Washington, DC, USA,2012.

[8] R. Angelova and S. Siersdorfer, “A neighborhood-based approachfor clustering of linked document collections,” in Proc. CIKMConf., New York, NY, USA, 2006, pp. 778–779.

[9] A. Banerjee and S. Basu, “Topic models over text streams: A studyof batch and online unsupervised learning,” in Proc. SDM Conf.,2007, pp. 437–442.

[10] J. Chang and D. Blei, “Relational topic models for document net-works,” in Proc. AISTASIS, Clearwater, FL, USA, 2009, pp. 81–88.

[11] D. Cutting, D. Karger, J. Pedersen, and J. Tukey, “Scatter/Gather:A cluster-based approach to browsing large document collec-tions,” in Proc. ACM SIGIR Conf., New York, NY, USA, 1992,pp. 318–329.

[12] I. Dhillon, “Co-clustering documents and words using bipartitespectral graph partitioning,” in Proc. ACM KDD Conf., New York,NY, USA, 2001, pp. 269–274.

[13] I. Dhillon, S. Mallela, and D. Modha, “Information-theoretic co-clustering,” in Proc. ACM KDD Conf., New York, NY, USA, 2003,pp. 89–98.

[14] P. Domingos and M. J. Pazzani, “On the optimality of the sim-ple Bayesian classifier under zero-one loss,” Mach. Learn., vol. 29,no. 2–3, pp. 103–130, 1997.

[15] M. Franz, T. Ward, J. S. McCarley, and W. J. Zhu, “Unsupervisedand supervised clustering for topic tracking,” in Proc. ACM SIGIRConf., New York, NY, USA, 2001, pp. 310–317.

[16] G. P. C. Fung, J. X. Yu, and H. Lu, “Classifying text streams in thepresence of concept drifts,” in Proc. PAKDD Conf., Sydney, NSW,Australia, 2004, pp. 373–383.

[17] H. Frigui and O. Nasraoui, “Simultaneous clustering anddynamic keyword weighting for text documents,” in Survey ofText Mining, M. Berry, Ed. New York, NY, USA: Springer, 2004,pp. 45–70.

[18] S. Guha, R. Rastogi, and K. Shim, “CURE: An efficient clusteringalgorithm for large databases,” in Proc. ACM SIGMOD Conf., NewYork, NY, USA, 1998, pp. 73–84.

[19] S. Guha, R. Rastogi, and K. Shim, “ROCK: A robust cluster-ing algorithm for categorical attributes,” Inf. Syst., vol. 25, no. 5,pp. 345–366, 2000.

[20] Q. He, K. Chang, E.-P. Lim, and J. Zhang, “Bursty feature repre-sentation for clustering text streams,” in Proc. SDM Conf., 2007,pp. 491–496.

[21] A. Jain and R. Dubes, Algorithms for Clustering Data. EnglewoodCliffs, NJ, USA: Prentice-Hall, Inc., 1988.

[22] T. Liu, S. Liu, Z. Chen, and W.-Y. Ma, “An evaluation of featureselection for text clustering,” in Proc. ICML Conf., Washington,DC, USA, 2003, pp. 488–495.

Page 15: side info for mining text data

AGGARWAL ET AL.: ON THE USE OF SIDE INFORMATION FOR MINING TEXT DATA 1429

[23] A. McCallum. (1996). Bow: A Toolkit for Statistical LanguageModeling, Text Retrieval, Classification and Clustering [Online].Available: http://www.cs.cmu.edu/ mccallum/bow

[24] Q. Mei, D. Cai, D. Zhang, and C.-X. Zhai, “Topic modeling withnetwork regularization,” in Proc. WWW Conf., New York, NY,USA, 2008, pp. 101–110.

[25] R. Ng and J. Han, “Efficient and effective clustering methods forspatial data mining,” in Proc. VLDB Conf., San Francisco, CA,USA, 1994, pp. 144–155.

[26] G. Salton, An Introduction to Modern Information Retrieval. London,U.K.: McGraw Hill, 1983.

[27] H. Schutze and C. Silverstein, “Projections for efficient documentclustering,” in Proc. ACM SIGIR Conf., New York, NY, USA, 1997,pp. 74–81.

[28] F. Sebastiani, “Machine learning for automated text categoriza-tion,” ACM CSUR, vol. 34, no. 1, pp. 1–47, 2002.

[29] C. Silverstein and J. Pedersen, “Almost-constant time clusteringof arbitrary corpus sets,” in Proc. ACM SIGIR Conf., New York,NY, USA, 1997, pp. 60–66.

[30] M. Steinbach, G. Karypis, and V. Kumar, “A comparison of docu-ment clustering techniques,” in Proc. Text Mining Workshop KDD,2000, pp. 109–110.

[31] Y. Sun, J. Han, J. Gao, and Y. Yu, “iTopicModel: Information net-work integrated topic modeling,” in Proc. ICDM Conf., Miami, FL,USA, 2009, pp. 493–502.

[32] W. Xu, X. Liu, and Y. Gong, “Document clustering based on non-negative matrix factorization,” in Proc. ACM SIGIR Conf., NewYork, NY, USA, 2003, pp. 267–273.

[33] T. Yang, R. Jin, Y. Chi, and S. Zhu, “Combining link and contentfor community detection: A discriminative approach,” in Proc.ACM KDD Conf., New York, NY, USA, 2009, pp. 927–936.

[34] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An efficientdata clustering method for very large databases,” in Proc. ACMSIGMOD Conf., New York, NY, USA, 1996, pp. 103–114.

[35] Y. Zhao and G. Karypis, “Topic-driven clustering for documentdatasets,” in Proc. SIAM Conf. Data Mining, 2005, pp. 358–369.

[36] Y. Zhou, H. Cheng, and J. X. Yu, “Graph clustering based on struc-tural/attribute similarities,” PVLDB, vol. 2, no. 1, pp. 718–729,2009.

[37] S. Zhong, “Efficient streaming text clustering,” Neural Netw.,vol. 18, no. 5–6, pp. 790–798, 2005.

Charu C. Aggarwal is a Research Scientistat the IBM T. J. Watson Research Center inYorktown Heights, NY, USA. He completed hisB.S. from IIT Kanpur in 1993 and his Ph.D. fromMassachusetts Institute of Technology in 1996.His research interest during his Ph.D. years wasin combinatorial optimization (network flow algo-rithms), and his thesis advisor was ProfessorJames B. Orlin. He has since worked in the fieldof performance analysis, databases, and datamining. He has published over 200 papers in ref-

ereed conferences and journals, and has filed for, or been granted over80 patents. Because of the commercial value of the above-mentionedpatents, he has received several invention achievement awards andhas thrice been designated a Master Inventor at IBM. He is a recip-ient of an IBM Corporate Award (2003) for his work on bio-terroristthreat detection in data streams, a recipient of the IBM OutstandingInnovation Award (2008) for his scientific contributions to privacy tech-nology, and a recipient of an IBM Research Division Award (2008)and IBM Outstanding Technical Achievement Award (2009) for his sci-entific contributions to data stream research. He has served on theprogram committees of most major database/data mining conferences,and served as program vice-chairs of the SIAM Conference on DataMining, 2007, the IEEE ICDM Conference, 2007, the WWW Conference2009, and the IEEE ICDM Conference, 2009. He served as an associateeditor of the IEEE Transactions on Knowledge and Data Engineeringfrom 2004 to 2008. He is an associate editor of the ACM TKDD, anaction editor of the Data Mining and Knowledge Discovery, an asso-ciate editor of the ACM SIGKDD Explorations, and an associate editorof the Knowledge and Information Systems. He is a fellow of the IEEEand the ACM for “contributions to knowledge discovery and data miningtechniques.”

Yuchen Zhao received his bachelor’s degreein computer software from Tsinghua University,China in 2007 and the Ph.D. degree in ComputerScience from the University of Illinois at Chicago,Chicago, IL, USA in 2013. Currently he is a datascientist at Sumo Logic Inc, Redwood City, CA,USA. He has been working in the field of datamining, especially graph mining, text mining andalgorithms on stream data.

Philip S. Yu received the B.S. Degree inE.E. from National Taiwan University, the M.S.and Ph.D. degrees in E.E. from StanfordUniversity, and the M.B.A. degree from NewYork University. He is a Distinguished Professorin the Department of Computer Science at theUniversity of Illinois at Chicago, Chicago, IL, USAand also holds the Wexler Chair in InformationTechnology. He spent most of his career at IBMThomas J. Watson Research Center and wasmanager of the Software Tools and Techniques

group. His research interests include data mining, privacy preservingdata publishing, data stream, Internet applications and technologies,and database systems. He has published more than 800 papersin refereed journals and conferences. He holds or has applied formore than 300 U.S. patents. He is a fellow of the ACM and theIEEE. He is the Editor-in-Chief of ACM Transactions on KnowledgeDiscovery from Data. He is on the steering committee of the IEEEConference on Data Mining and ACM Conference on Informationand Knowledge Management and was a member of the IEEE DataEngineering steering committee. He was the Editor-in-Chief of IEEETransactions on Knowledge and Data Engineering (2001-2004). Hehad also served as an associate editor of ACM Transactions onthe Internet Technology (2000-2010) and Knowledge and InformationSystems (1998-2004). In addition to serving as program committeemember on various conferences, he was the program chair or co-chairs of the 2009 IEEE Intl. Conf. on Service-Oriented Computingand Applications, the IEEE Workshop of Scalable Stream ProcessingSystems (SSPS’07), the IEEE Workshop on Mining Evolving andStreaming Data (2006), the 2006 joint conferences of the 8th IEEEConference on E-Commerce Technology (CEC’ 06) and the 3rd IEEEConference on Enterprise Computing, E-Commerce and E-Services(EEE’ 06), the 11th IEEE Intl. Conference on Data Engineering, the 6thPacific Area Conference on Knowledge Discovery and Data Mining, the9th ACM SIGMOD Workshop on Research Issues in Data Mining andKnowledge Discovery, the 2nd IEEE Intl. Workshop on Research Issueson Data Engineering: Transaction and Query Processing, the PAKDDWorkshop on Knowledge Discovery from Advanced Databases, andthe 2nd IEEE Intl. Workshop on Advanced Issues of E-Commerce andWeb-based Information Systems. He served as the general chair or co-chairs of the 2009 IEEE Intl. Conf. on Data Mining, the 2009 IEEE Intl.Conf. on Data Engineering, the 2006 ACM Conference on Informationand Knowledge Management, the 1998 IEEE Intl. Conference on DataEngineering, and the 2nd IEEE Intl. Conference on Data Mining. He hadreceived several IBM honors including 2 IBM Outstanding InnovationAwards, an Outstanding Technical Achievement Award, 2 ResearchDivision Awards and the 94th plateau of Invention Achievement Awards.He was an IBM Master Inventor. He received a Research ContributionsAward from IEEE Intl. Conference on Data Mining in 2003 and also anIEEE Region 1 Award for "promoting and perpetuating numerous newelectrical engineering concepts" in 1999.

� For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.