32
This may be the author’s version of a work that was submitted/accepted for publication in the following source: Sutanto, Taufik Edy & Nayak, Richi (2018) Fine-grained document clustering via ranking and its application to social media analytics. Social Network Analysis and Mining, 8, Article number: 29 1-19. This file was downloaded from: https://eprints.qut.edu.au/120480/ c Springer-Verlag GmbH Austria, part of Springer Nature 2018 This work is covered by copyright. Unless the document is being made available under a Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other uses. If the docu- ment is available under a Creative Commons License (or other specified license) then refer to the Licence for details of permitted re-use. It is a condition of access that users recog- nise and abide by the legal requirements associated with these rights. If you believe that this work infringes copyright please provide details by email to [email protected] Notice: Please note that this document may not be the Version of Record (i.e. published version) of the work. Author manuscript versions (as Sub- mitted for peer review or as Accepted for publication after peer review) can be identified by an absence of publisher branding and/or typeset appear- ance. If there is any doubt, please refer to the published source. https://doi.org/10.1007/s13278-018-0508-z

, Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

This may be the author’s version of a work that was submitted/acceptedfor publication in the following source:

Sutanto, Taufik Edy & Nayak, Richi(2018)Fine-grained document clustering via ranking and its application to socialmedia analytics.Social Network Analysis and Mining, 8, Article number: 29 1-19.

This file was downloaded from: https://eprints.qut.edu.au/120480/

c© Springer-Verlag GmbH Austria, part of Springer Nature 2018

This work is covered by copyright. Unless the document is being made available under aCreative Commons Licence, you must assume that re-use is limited to personal use andthat permission from the copyright owner must be obtained for all other uses. If the docu-ment is available under a Creative Commons License (or other specified license) then referto the Licence for details of permitted re-use. It is a condition of access that users recog-nise and abide by the legal requirements associated with these rights. If you believe thatthis work infringes copyright please provide details by email to [email protected]

Notice: Please note that this document may not be the Version of Record(i.e. published version) of the work. Author manuscript versions (as Sub-mitted for peer review or as Accepted for publication after peer review) canbe identified by an absence of publisher branding and/or typeset appear-ance. If there is any doubt, please refer to the published source.

https://doi.org/10.1007/s13278-018-0508-z

Page 2: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

Fine-Grained Document Clustering via Ranking

and Its Application to Social Media Analytics

Tau�k Sutanto1 and Richi Nayak2

1Syarif Hidayatullah State Islamic University Jakarta2Queensland University of Technology (QUT)

[email protected], [email protected]

Abstract. Extracting valuable insights from a large volume of unstruc-tured data such as texts through clustering analysis is paramount tomany big data applications. However, document clustering is challengedby the computational complexity of the underlying methods and the highdimensionality of data, especially when the number of required clustersis large. A �ne-grained clustering solution is required to understand adataset that represents heterogeneous topics such as social media data.This paper presents the Fine-Grained document Clustering via Ranking(FGCR) approach which leverages the search engine capability of han-dling big data e�ciently. Ranking scores from a search engine are usedto calculate dynamic clusters' representations called loci in an unsuper-vised learning setting. Clustering decisions are e�ciently made based onan optimal selection from a small subset of loci instead of the entire clus-ter set as in the conventional centroid-based clustering. A comprehensiveempirical study on several social media datasets shows that FGCR is ableto produce insightful and accurate �ne-grained solution. Moreover, it ismagnitudes faster and requires less computational resources comparedto other state-of-the-art document clustering approaches.

Keywords: Clustering, Fine-grained, Loci, Ranking, Social Media Analytics.

1 Introduction

Social media data is fundamental to many big data applications [6]. E�cientlyextracting insights from this large volume of data, beyond the high investmentin analytics infrastructure, is a signi�cant challenge in big data research [25].Unsupervised learning techniques with clustering analysis have been proved suc-cessful to achieve this objective [22]. For example, Yin et. al. [65] used clusteringanalysis on social media data to build an awareness system of disasters or crisisevents [65]. Social media posts were incrementally clustered by topics and pre-sented on a map with users' geo-locations to highlight some events of interests(e.g. natural disasters) [65]. Similarly, authors in [16] applied clustering on so-cial media data generated from a hospitality application to identify customersinterests.

Page 3: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

Clustering a large volume of social media text data is challenging. Most ofthe existing text clustering methods su�er from the problems of accuracy andcomputational complexity due to the dimensionality of the data [12,48]. Due tothe data velocity and volume, �nding an acceptable number of clusters is also asigni�cant issue [65]. Moreover, social media data exhibit unique characteristicssuch as short-sized text, noise, and multiple and heterogeneous topics coverage[20].

When social media data is extracted from a speci�c trending topic over aninterval of time, there might not be a lot of cluster of topics involved in thedata. However, there are occasions when the data is crawled using more generalkeywords or topics such as social events and containing a lot of clusters [40,43].Another situation where social media data might have signi�cant number ofclusters is when social media data in a database is collected over time as in bigdata system. To �nd important information from these types of data, a �ne-grained clustering solution normally preferable to produce meaningful insightsor sense from the data [9, 65].

However, the majority of existing clustering algorithms face the problem ofincreasing computational complexity with the increase in the number of clus-ters [8, 34]. In centroid-based clustering, this is due to the increasing numberof pairwise similarity comparisons between documents and clusters. Fig.1 illus-trates the magnitude of a �ne-grained clustering problem where the runningtime of a well known clustering algorithm k-means++ [2] grows rapidly with theincrease in data instances (N) and cluster numbers (k) on a synthetic dataset.

Researchers have attempted to solve the scalability problem via several ap-proaches, some at the cost of accuracy. The most common techniques are em-ploying dimensionality reduction via projection and random subspace sampling[48,61]. In �ne-grained clustering, a subspace approach will be challenged by theenormous amount of total subspaces available [27]. A projection-based approachmay also not work due to the di�culty in �nding the appropriate projectionfunction [26].

There have been several attempts to implement conventional clustering meth-ods on a large infrastructure. Several notable challenges are presented when bigdata clustering is applied in a parallel and distributed setting. The accuracy ofthe clustering solution is usually lower than the serial counterparts, since theindividual optimal clustering outcome from each node will not necessarily mergeinto an optimal solution when combined [48]. Performance is still an issue whenconventional clustering is applied on a dedicated infrastructure. For a web-scaledataset, it is reported that thousands of hours are spent for a conventional clus-tering to �nish, even on massively parallel machines [5]. This emphasizes theneed for developing an e�cient and scalable �ne-grained clustering method byimproving the underlying algorithmic complexity and not relying solely on thehigh-end infrastructure.

In this paper, we propose a fast and e�cient Fine-Grained document Clus-tering via Ranking (FGCR) method based on the concept of ranking and locifor unsupervised learning. Fast here refers to short processing time, and e�-

Page 4: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

Fig. 1: k -means++ performance as kand N increases.

Fig. 2: FGCR system architecture.

cient refers to small computational memory usage. Loci are dynamic clusters'representations calculated from document ranking scores produced by a searchengine. Since ranking scores can be e�ciently calculated using a search engine,FGCR gains fast analytic values from a large volume of text data and yields a�ne-grained solution which is valuable in social media analytics.

Several datasets exhibiting a large number of groups were used in the experi-ments. Two datasets generated from a real-world social media application namelythe 2013 and 2014 Flickr social event detection problem are used [40,43]. Othersocial media data crawled from twitter, Instagram [13], Google+, and facebookare also used in the study. Empirical analysis reveals that FGCR is not onlyscalable to data with a large number of instances, attributes, and clusters, butalso produces an insightful and accurate clustering solution with very e�cientmemory usage.

FGCR brings several contributions to the area of big data research. Firstly,it generalizes the concept of loci from semi-supervised clustering [55] to unsuper-vised document clustering by eliminating the need for label information. This isimportant because a priori information in the form of labelled documents thatis needed in semi-supervised clustering is either costly to generate or seldomunavailable in most big data applications. Secondly, FGCR uses the innovativeapproach of harvesting the current scalable technology of search engines for dataanalytics and is considerably easier to implement. Unlike conventional large dataclustering methods that require a signi�cant amount of investment in big datainfrastructure, FGCR acts as an add-on to an existing search engine (as shownin Fig.2) and enables analytic values using only a standard machine (e.g. a per-sonal computer). FGCR have been implemented by the authors as part of thepublicly available open source package for document clustering1.

The speci�c contributions of this paper are as follows:

1 https://github.com/tau�kedys/FGCR

Page 5: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

• We describe a novel approach to generate clustering via ranking that cane�ciently handle millions of documents on a standard PC. Only the mostrelevant documents found by a ranking function are used in making cluster-ing decisions, consequently the necessity of scanning the entire document setis avoided.

• We generalize the concepts of dynamic cluster representation loci and rel-evant clusters to a fully unsupervised learning setting. An incremental ap-proach with re�nement is made to e�ciently generate clustering solutionswithout the need of any label information.

• We show that the proposed approach is able to produce a �ne-grained clus-tering solution using limited computational resources. A new scalable largedata system infrastructure is introduced where data is distributed in a searchengine and a relatively small amount of clustering computation is done on astandard machine instead of high-performance computers.

The rest of the paper is organized as follows. Section 2 provides works relatedto clustering via ranking and social media data clustering. The proposed ap-proach, its complexity, and its scalable implementation are elaborated in Section3. A comprehensive empirical study and benchmarks on several public datasetswith well known clustering algorithms are provided in Section 4. An example ofthe FGCR application in analysing social media data is given in Section 5 andconcluding remarks are presented in Section 6.

2 Related Work

Conventionally, organisations perform big data analytics by investing in speci�cinfrastructure such as a cluster computer. However, recent works suggest an al-ternative approach by unlocking the potential of search engines for advancedanalytics [5, 14, 54, 55]. Search engines have at least two appealing properties.Firstly, they have been established as a fast, cheap, robust, highly scalable andmature technology [19]. Secondly, big data analytics users usually already havetheir own information retrieval infrastructure, hence embedding advanced ana-lytic modules into this infrastructure has the potential of signi�cant savings. Inthe following sections, we explain how this search engine technology has beenused in ranking; how various types of cluster representation are useful in dealingwith high dimensionality and volume of the data; and the state-of-the-art ofsocial media clustering.

2.1 Clustering and Ranking

Ranking and clustering, two closely related concepts, have been well-studiedin the �eld of information retrieval [28] as part of the Clustering for Ranking(CfR) methods. One of the early conjectures is the cluster hypothesis [23, 59]that states �similar documents are likely to be relevant to an information request(i.e. query)�. Several studies have been conducted in evaluating the validity of

Page 6: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

this conjecture [41, 51, 60] and in assessing the use of clustering in improvingretrieval performance [28,42].

The concept of Ranking for Clustering (RfC) has emerged in the last fewyears. RfC methods utilize a ranking function to improve the performance ofa clustering process [54, 55], while CfR methods use clustering to improve thequality of the retrieved documents [17,29,42,66]. In RfC methods, each documentis used to generate a query and the query returns similar documents that help toderive and to understand the relationships amongst the documents in clusters. Incontrast, in an information retrieval case, a user speci�c query is used to initiatethe search for the most relevant documents.

There is a clear distinction between both approaches in terms of how thedocuments are clustered. An RfC method utilises the ranking scores in the clus-tering process, for example using patches [53] or centroid indexing [5]. The CfRmethod, on the other hand, uses more conventional clustering methods.

There are only a handful of RfC methods that link the concept of rankingwith clustering analysis [5,14,54,55]. Under the assumption of the reversed clus-ter hypothesis, Fuhr et al. [14] introduced the new evaluation metric of pairwiserecall and precision using a probabilistic ranking function, and showed the pos-sibility of generating an optimum clustering solution given these measures anda paired set of documents and queries.

The main contribution of their work was to introduce a new framework ofdocument clustering via ranking [14], however, they did not propose an RfCmethod under this framework. Recently, we have proposed two semi-superviseddocument clustering algorithms, CICR [54] and LSDC [55], that use rankedsearch results in choosing a small subset of the most relevant clusters. Thesemethods are able to identify �ne-grained structures in documents in a semi-supervised setting where some of the data are labelled [54,55].

In this paper we outline a novel approach suitable for settings where labeleddata is unavailable. Unlike CICR or LSDC that use prior information present inlabels, the proposed FGCR method is a fully unsupervised learning approach.There exist several challenges in applying the concept of ranking in an unsuper-vised fashion. Firstly, since no prior knowledge is available, the number of clustersin the data needs to be determined. Secondly, with the absence of any labels,a new formulation for clusters' loci needs to be de�ned. In a semi-supervisedsetting, a cluster's locus is de�ned via the label information. FGCR uses an in-cremental process with re�nement to identify the group structures in the data,which is explained below in Section 3.4.

Recently, a variant of k-means was proposed by exploring the concept ofranking to improve the computational e�ciency on text clustering [5]. Thereexist several signi�cant di�erences between this work and FGCR. Firstly, FGCRdoes not need an input parameter k (the number of clusters), a bottleneck inpartitional methods due to the di�culty in estimating its value in large data.Secondly, FGCR uses an incremental approach, while k-means by ranking usesdouble iterative steps (WAND k-means) that is known to be computationallyexpensive [5]. It is also important to note that k-means by ranking is not robust

Page 7: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

with centroid initialization and k-means++ [2] is needed for this purpose. Asexplained in Section 1, k-means++ [2] is not suitable for the �ne-grained clus-tering problems. Finally, as it uses conventional centroids, k-means by rankingwill have the tendency to form only spherically shaped clusters, while FGCR isable to form more arbitrary structures. This is due to the fact that the clusters'loci are not necessarily at the centre of clusters (as it will be shown in Fig. 3d).

2.2 Cluster Representations

Conventionally, a cluster representation is formed by calculating the central ten-dency measures, such as means or medoids, based on all documents' vector repre-sentation in each cluster. In partitional clustering, clustering decisions are thenmade using pairwise comparisons between documents and all of the centroids(Fig. 3a). The e�cacy of text clustering methods is challenged by the well known�curse of dimensionality�. It has been shown that the di�erence between nearand far objects diminishes with the growing number of dimensions [3]. Conse-quently, the outcome of pairwise comparison based on a centroid might not leadto a meaningful outcome.

Several attempts have been made to overcome this problem. For instance,CICR [54] improves e�ciency by reducing the need for calculating all of thepairwise distances. In CICR, clustering decisions are made by comparing dis-tances between documents and a small number of most relevant cluster centroids(Fig. 3b). A few recent studies have shown that the cluster representation playsan important role in clustering high-dimensional data [18, 56, 57]. For instance,hubs-based clustering has been reported to outperform the clustering that usesconventional cluster representations such as centroids' means or medoids [18,57].A hub is identi�ed by calculating the neighbourhood of documents within acluster. Just like centroid-based clustering, a clustering decision in hubs-basedclustering is done by pairwise comparisons of distances between documents andall of the cluster hubs (Fig. 3c).

The hubness score Hk of a document d depends on the distance metric andthe k-nearest neighbour (k-NN) of data point d′, and is de�ned as Hk(d) =|dist(d, d′)| [18]. Computational complexity of one of the fastest k-NN algorithmfor N data points is known to be Θ(pN t) where t ∈ (1, 2) and p is the dimensionof the data [7]. Notwithstanding the success of the hubs approaches in terms oftheir accuracy, the hubness score calculation complexity, in addition to the clus-tering complexity, makes the hub-based clustering approach currently unfeasiblefor large data [57].

Intuitively, a cluster locus in the proposed FGCRmethod is a low-dimensionalprojection of a cluster hub's neighbourhood. Unlike cluster hubs (or centroids)that act as a �xed cluster representation, loci are clusters' representations fora speci�c targeted document only (i.e. d1 or d2 in Fig. 3d). Using this clusterrepresentation, pairwise comparison calculations between a document and all ofthe clusters is not needed, hence e�ciency is improved.

The projection in FGCR is inline with the Johnson Lindenstrauss lemma [24]which states that a small set of points in a high-dimensional space can be em-

Page 8: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

(a) Centroids means. (b) Relevant clusters.

(c) Hubs. (d) Loci.

Fig. 3: Di�erent kinds of cluster representations.

bedded into a lower-dimensional space in such a way that distances between thepoints are nearly preserved. Fig. 3 depicts di�erent kinds of cluster representa-tions. Further details on clusters' loci are given in Section 3.4.

2.3 Social Media Clustering

The proposed clustering method is applicable to any high-dimensional text dataand application that require generation of �ne-grained clusters. Due to the wideapplicability and popularity of social media data, this paper uses this as show-case and evaluation. In this section, we brie�y present the data analytic methodsused in this domain. Social media data can be analysed using a supervised learn-ing approach such as sentiment analysis [35]. When no labelled information ispresent, which is usually the case, unsupervised approaches such as communitydetection [38] or text clustering [20] are used. Community detection is employed

Page 9: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

to understand how the users are grouped based on their network information orlinkage information such as via follower identi�cation [38]. However, the networkinformation may not always be available due to technical or complex privacy,legal, or ethical issues [58]. In cases where the required network information isnot present or when the objective of the analysis is understanding the contentof the posts, text clustering can be used.

Several attempts have been made to cluster social media text data, usingvarious methods. TweetMotif [37] performs CfR of tweet data by grouping usertweets searched by topics. Another approach utilises geographical location in-formation to detect di�erent lexical meanings to words [10] and Rosa et al. [45]conducted clustering and classi�cation on prede�ned limited topics in social me-dia data.

Analysing social media text data through clustering analysis is not only di�-cult due to the common challenges of volume, velocity, and high-dimensionalityinherent with big data, but also because of its distinct challenges. Some of themare poor quality of content through texts of very small length, with loose expres-sions, the time sensitivity, and the ample number of words that have extrinsicinformation [20].

FGCR attempts to focus on achieving e�ciency and accuracy in clusteringlarge and heterogeneous social media datasets that may contain numerous oftopics. FGCR generates a �ne-grained clustering solution in pursuit of meaning-ful discovery of useful information [9]. FGCR is a hard clustering algorithm thatwill create disjoint partitions of the data. In other words, a media document isplaced into a single cluster. It is done to support the fact that social media postsare normally short length documents and only contain a single topic in a singlepost. For longer documents, soft clustering or topic based document clusteringare normally preferable [4].

3 Fine-grained Document Clustering via Ranking

The following section includes speci�c processes employed in FGCR. Documentrepresentation, the querying process, the ranking function, the FGCR algorithm,computational complexity and its scalable implementation are elaborated.

3.1 Document and Cluster Representation

Let D = {d1, d2, d3, . . . , dN} be the collection of N documents. Each document dis represented as a set of n distinct terms notated as {t1, t2, . . . , tn}. The VectorSpace Model (VSM) of D is denoted as V = {v1, v2, . . . , vN} where a documentd is represented as v(d) = vd. FGCR supports any term weighting scheme tobuild the VSM for a document corpus. In this paper, we use the document lengthnormalized tf -idf (term frequency-inverse document frequency) weighting [63] inall of the experiments. The use of this weighting enables the clustering algorithmto utilise the number of term occurrences in a document, the document's length,

Page 10: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

and the term's importance in the corpus. For each term t ∈ d, the term weightis calculated as follows:

vdt = (log(td) + 1∑t∈d log(t

d) + 1)(

n

1 + cn)log(

N − tD

tD), (1)

where td = |t ∈ d| is the number of terms t in d (i.e., term frequency). The �rstterm of Equation (1) is normalized and logarithmically scaled term frequency in adocument. The second term of Equation (1) is document length normalization asproposed in [50]. With this formulation, shorter documents tends to have higherweights than longer ones. The value of the constant c is equal to the suggestedvalue in [63], that is c = 0.00115. Finally, the last term of Equation (1) is inversedocument frequency factor, where tD = |d ∈ D : t ∈ d| is the set cardinality ofall documents in D that contain the term t (i.e., document frequency).

A cluster is represented as a set of triplets, C = (d, k, φ), where k is thecluster label and φ is the optimal distance between d and the closest clusterrepresentation, as follows:

C = {Cφd,k : d ∈ D, k ∈ N, φ ∈ R}. (2)

A set of all documents and all cluster labels in C can be expressed as C−•,− and

C−−,• respectively. A cluster label of a particular document d is C−d,∗ and a set of

cluster labels from a set of documents A ⊂ D is C−A,∗.

3.2 Document to Query Representation

Each document that needs to be clustered is represented as a query and is posedto a search engine for identifying relevant documents. A document query is rep-resented as a set of important terms (or phrases) extracted from the document.This process is akin to document summarization such as TextRank [36] but dif-fers in its objective. In FGCR, the term extraction is not intended as a way tosummarize the document's content, but to choose a small set of representativeterms that will determine the relevant documents and the clusters.

Selection of terms, phrases, sentences, or even paragraphs from a documentfor query representation can be done using the document's term weighting vec-tor, topic models, and/or external information [14]. Additionally, some knownstructures within the text entities, such as the excerpt, summary, title, abstractor keywords can be utilised to formulate a query representation. The focus ofthis paper is to use a document query to identify a selected set of clusters, there-fore, query variations and expansions are beyond the scope of this paper. FGCRutilizes a search engine to obtain various statistics on term distributions suchas tfD that will be used in forming a query representation. Combining tD withother local or known parameters such as td, N , and |d|, term weight values suchas idf or tf -idf can be calculated e�ciently.

Formally, let T = {(ti, wti) : ti ∈ d} be a set of distinct terms and its weightsgenerated from a document d. Let the list of the paired entities in T be ordered

Page 11: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

by the term weight (i.e. wt1 ≥ wt2 ≥ . . . ≥ wts ≥ . . . ≥ wtn). In most casesthe cardinality of T is too large to form an e�cient query, therefore a �lteringprocess is conducted by choosing the s highest weighted terms to represent aquery from a document d, qd = {t1, t2, . . . , ts}. The set of all queries in D isdenoted as Q = {q1, q2, . . . , qN}. It has been reported that longer queries do notnecessarily contribute to the retrieval performance [31]. Our empirical study inSection 4.6 shows similar results, where considerably short length queries aresu�cient to produce accurate clustering results.

3.3 Document Weighting Scheme

Using a search engine and a query q, a set of at most m relevant documentsand their ranking score vectors r is generated. A search engine has the ability toe�ciently calculate the ranking score of a document with respect to the queryusing its inverted index [33]. We denote a general ranking function R as:

R : q → S = {(di, ri) : i = 1, 2, . . . ,m′}, (3)

where 0 ≤ m′ ≤ m. If m′ = 0, this indicates that there is no relevant documentfound in D by the search engine for the given query q. According to our empiricalstudy detailed in Section 4.6, the optimal size of m is between 20 to 30. Thisresult is supported by several works in IR where it has been shown that in real-lifeapplications, users normally examined only around 20 �rst search results [31,52].

In this paper the BM25 weighting scheme is utilized since it is widely avail-able in most modern search engines and well-known for its exceptional perfor-mance [44]. The BM25 document ranking score of a document d to a queryq = {t1, t2, ..., ts} is calculated as:

RBM25(q, d) =

s∑i=1

tdi (k1 + 1)

tdi + k1(1− b+ b Nµd)log(

N − tDitDi

), (4)

where µd =∑d∈D |d| is the average of all documents' length in the document

collection, and b and k1 are constants. The known optimal value for b is 0.75and k1 ∈ [1.2, 2.0] [33].

3.4 The FGCR Algorithm

FGCR assume the reversed form of cluster hypothesis [14], that is the relevantdocuments returned in response to a query will tend to be similar to one-another.FGCR uses a combination of loci and relevant clusters concepts to e�cientlyform clusters. The use of loci makes the computation of clusters' representationse�cient since it only uses a small set of documents instead of all documentsin a cluster. By utilizing the relevant cluster concept, FGCR does not needpairwise similarity comparisons between a document and all of the clusters.These approaches allow FGCR to generate a �ne-grained clustering solutione�ciently. Algorithm 1 details the overall process.

Page 12: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

Algorithm 1: FGCR Algorithm.

input : Indexed documents D, documents' vector representation V , aset of documents' queries Q, and query return size m.

output: Disjoint K partitions of D.Initialize: C ← ∅, Dre�ne ← ∅ and K ← 0

for d ∈ D do

S ← R(Qd) = {di : i = 1, 2, . . . ,m, di 6= d} // Search results

S` ← S ∩ C−(•,−) // Clustered relevant documents

if∣∣S`∣∣ 6= |S| thenDre�ne ← Dre�ne ∪ {d} // d included in the refinement

end

if S` = ∅ thenK ← K + 1 // Increment the number of cluster

C ← C ∪ C∞(d,K) // d forms a new cluster

else

Rc ← {k : k ∈ C−(S`,∗)} // Relevant clusters

`dk ← {p : p ∈ S`, C−(p,∗) = k}, k ∈ Rc // Loci sets of d

`dk ← {∑

p∈`dkvp

|`dk|: k ∈ Rc} // Loci's means

φ∗ ← mink{φ(d, `dk) : k ∈ Rc}

C ← C ∪ Cφ∗

(d,k∗)

end

end

for d ∈ Dre�ne do

S ← R(Qd)← {di : i = 1, 2, . . . ,m, di 6= d}if S 6= ∅ then

Calculate Rc, `dk, `dk, and φ

∗ as in the previous step.if φ∗ < C∗(d,−) then

C ← C ∪ Cφ∗

(d,k∗) // Updating the clustering decision

end

end

end

return C

Finding a new cluster: Given a set of indexed documents D, the clusteringprocess starts by identifying a set of relevant documents S for a document queryqd using a search engine. Documents in the set S will be labelled using theexisting information on clusters (if any exists) as S` = S ∩ C−(•,−). If S

` = {},a new singleton cluster is formed with d. This indicates that the search enginehas not found any relevant documents in the dataset with regards to query qd

Page 13: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

or there are no documents that are relevant to qd that have yet been clusteredby FGCR.

Finding relevant clusters: If the query qd returns labelled documents (i.e.S` 6= {}), the task will be to identify the closest cluster to which the documentd should belong amongst the set of relevant clusters. A set of relevant clustersis de�ned as follows:

De�nition 1. (Relevant Clusters) A set of relevant clusters Rc to a docu-ment d is the clusters that can be identi�ed within the set of top-m most relevantdocuments of S to qd. Formally, it is de�ned as:

Rc = {k : k ∈ C−S`,∗} (5)

where S` is a non-empty subset of labelled documents and C−S`,∗ is the set of

associated cluster labels.

Finding clusters loci: A cluster locus for a document d is de�ned by using theset of relevant documents to d that are present inside the cluster. In genetics,a locus (plural loci) is de�ned as the particular position in a chromosome of agene that controls a certain trait in a living organism [15](illustrated in Fig. 4).The trait in the locus can correspond to the height property of a plant or thesmoothness of the plant's seed [15].

Fig. 4: The concept of locus in a chromosome.

Inspired by this natural phenomenon, a cluster's locus in FGCR is de�ned asa subset of documents that is most relevant to a targeted document. Formally acluster locus is de�ned as follows.

De�nition 2. (Cluster Locus) Loci of a document d are the distinct sets ofdocuments present in each of its relevant clusters Rc. The Locus of d in a relevantcluster is formally de�ned as:

`dk = {d : d ∈ C−S`,k}, k ∈ Rc. (6)

The trait in the FGCR cluster's locus is characterized by the keywords usedin representing the document as query (qd). The location of the locus is then

Page 14: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

represented by the labeled documents that are relevant to qd (as shown in Fig.3d). It is to be noted that each document in the dataset will have its ownclusters' loci. Analogous to how di�erent traits have di�erent locus location ina chromosome. Two documents in the same cluster can have di�erent loci. Forthis reason we refer loci as dynamic cluster representations, since clusters' lociof a document d is dynamically calculated depending on the label informationof the most relevant documents to qd.

Assigning d to a cluster: Cluster assignment in FGCR is driven by the clus-tering hypothesis that states �closely associated documents tend to be relevant tothe same requests� [23,59] and its reversed form that states �Documents relevantto the same queries should occur in the same cluster � [14]. This leads us to inferthat documents in S can be considered similar to qd (or to d). The documentranking scores in S can be treated as a similarity measure between the querydocument d and the returned documents in S. The relevant clusters and lociformed from the documents in S can be utilised to assign a document d to acluster.

Given a set of relevant clusters k ∈ Rc their corresponding locus `dk, eachlocus mean can be calculated as:

`dk =

∑p∈`dk

vp∣∣`dk∣∣ , k ∈ Rc, (7)

where∣∣`dk∣∣ is the size of locus k of d. The clustering decision is made as a solution

of the following simple optimization:

mink{φ(d, `dk) : k ∈ R

c}, (8)

where φ is a dissimilarity function. A document d is assigned to a cluster ac-cording to the label information of the closest relevant cluster. We would like toemphasize that |Rc| is relatively small (as shown in Fig. 3d) as compared to thetotal number of clusters in the solution. Hence the optimization in (8) can bee�ciently solved without the need of a sophisticated optimization solver. We usethe most commonly used cosine function as φ between two vectors (u, v) ∈ <nas φcos(u, v) = 1− u.v′

||u||||v|| .

Re�nement stage: Considering each document as a query, all documents areassigned to clusters in an incremental fashion. Therefore, the number of labeleddocuments in S for a document query qd may vary according to their inputorder. There can be three possibilities of the number of labelled documents in Sas illustrated in Fig. 5. (1) S may contain only unlabelled documents (Fig. 5a).It may be the very beginning of the FGCR process in which documents in thecollection do not yet have cluster labels assigned. As explained before, in thissituation d will form a new singleton cluster. However, this clustering decisionis temporary and will be revisited at the end of the incremental process once all

Page 15: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

documents in D have label information. (2) S contains some labelled documents(i.e. S` ⊂ S) (Fig. 5b). In this case, a clustering decision will be revisited toensure d has been correctly assigned to a cluster. (3) S contains all labelleddocuments (i.e. S` = S, illustrated in Fig. 5c). Re�nement process is not doneto the documents that fall in to this scenario.

(a) S` = {}. (b) S` ⊂ S. (c) S` = S.

Fig. 5: Three possibilities of cluster information in the set of relevant documentsS.

Documents that fall into �rst two conditions (i.e. S` 6= S) are included there�nement phase for cluster reassignment. Once the increment phase is com-pleted, a re�nement phase is applied to these documents in D. The re�nementprocess in FGCR is done in a similar fashion as in the incremental step. At theend of incremental phase, each document would have the cluster label assigned.Therefore, each document query, in the re�nement process, will have S` = S.Additionally, the previous optimal distance information of d to its closest locus

(Cφ∗

d,−) is available from the incremental process. This value can then be usedin the re�nement process to decide whether a document d should be moved toanother cluster or remain on its previous cluster.

3.5 Complexity and Scalable Implementation

The FGCR algorithm is an incremental clustering algorithm that sequentiallyassigns each document to a cluster. Main procedures in FGCR include obtain-ing the relevant documents set (S) for a document query, formation of clusterloci (equation (6)) and the optimization process (equation (8)). The latter twoprocesses mainly depend on m (i.e. the size of S). Therefore FGCR asymptoticcomplexity is O(tmN) where t be the querying time to form S. As shown inthe next section, a relatively small m is required for FGCR. The optimizationin equation (8) only incurs a small computational cost as the cardinality of Rc

is less than or equal to m. Since both t and m are small constants, the com-plexity of FGCR is linear to the number of documents (N). It can be seen thatthe FGCR's complexity is not a�ected by the total number of clusters in the

Page 16: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

data (k). FGCR only uses cluster information from Rc to produce clusteringdecisions. For this reason, FGCR is suitable for clustering datasets with a largenumber of clusters.

FGCR performance partly depends on the underlying search engine's per-formance in returning relevant documents to a query document to generate S.Search engine technology is quite established and research has shown exceptionalcomputational performance. Using a distributed index, some search engines arecapable of processing up to millions of queries per second [1]. Although thediscussion on the performance of a search engine is beyond the scope of thispaper, we present some of the measures that were employed to enhance theFGCR performance. (1) Most search engines support batch queries. Using abatch query process, the processing time to get relevant documents was signi�-cantly reduced. (2) By storing the query results executed by the search engineprior to the clustering process (i.e. cached in memory, stored in a disk, or in adatabase), the search process becomes a simple atomic (id) look-out instead ofa full-text search. Normally, a hashed table look-out (mapped table) complexityis O(1). (3) Communication cost between the clustering machine and the searchengine can also be minimized. FGCR only needs the document identities (ids)returned by a search engine in response to a query, assuming that the top-mreturned documents from a query will be sorted based on the relevant scores.It does not explicitly need the document ranking scores. By explicitly exclud-ing unnecessary meta-information (e.g. attribute values or ranking scores), thenetwork communication cost was reduced.

FGCR can e�ciently run on a standard machine (e.g. a PC) connected to asearch engine instead of on conventional large data machines such as cluster com-puters or HPCs (as illustrated in Fig. 2). In each document clustering decision,the clustering machine (PC) will only process a small subset of most relevantdocuments fetched from the search engine. This way the PC will not need a lotof computational resources to execute FGCR even for a large document collec-tions. The search engine, running e�ectively on another machine, will providethe relevant document information to the PC e�ciently. The combination of thisapproach results in a scalable, e�cient, and cost-e�ective document clusteringsystem for knowledge discovery.

4 Empirical Analysis

This section presents the evaluation results of FGCR applied on di�erent datasets,benchmarks and settings. FGCR is �rst tested to evaluate its scalability and abil-ity to discover knowledge in fast manner. FGCR results are then evaluated totest the validity of producing correct clusters by measuring the internal and ex-ternal cluster validity indices. These results are compared with other well knownclustering approaches. Experiments to evaluate the �ne-grained capability fol-low. Finally, sensitivity analysis was conducted by carrying out experiments withdi�erent FGCR settings.

Page 17: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

4.1 Datasets and pre-processing

Six datasets from diverse social media outlets, namely Flickr, twitter, Instagram,Google+ and facebook, were used in the experiments. The Flickr data weretaken from the 2013 and 2014 social event detection (SED) tasks [40, 43]. Thetwitter and Google+ data were obtained by capturing posts related to datascience, big data, and its related technology in September and October 2016.Non-empty Instagram tag data were used from anonymized media ID in Januaryand February 2014 [13]. The facebook data were obtained from several of 2016'smost popular public pages.

Table 1: Summary of datasets.

Dataset N L k µ|d| σ|d|

SED13 432,164 64,053 20,165 21.82 27.91SED14 358,372 52,495 16,948 22.57 30.00twitter 2,686,877 367,388 54,897 13.62 5.55facebook 1,615,685 261,734 53,756 22.50 106.12Instagram 764,521 126,250 100,863 10.34 9.00Google+ 634,344 390,665 23,845 158.83 828.66

N : Total number of posts, L: Data dimension, k:Total number of (ground truth) clusters, µ|d| andσ|d| are average and standard deviation of docu-ments length in the dataset.

Standard text pre-processing was applied to all datasets. English stop-wordsand non-alphanumeric characters were �ltered and replaced by a space. Singleoccurrence words and words that frequently occurred in the documents were�ltered. Finally, all text was converted to its lower-case form.

Table 1 summarizes the pre-processed data used in the experiments. Thenumber of clusters (k) in Table 1 is the number of clusters from the ground truthlabels for SED13 and SED14 datasets, while for other datasets the number ofclusters is produced by FGCR. The large value of ground truth labels in SED13and SED14 (and other datasets) ascertain the �ne-grained nature of the socialmedia data. The large number of clusters results from users posting a variety ofcontents (topics) to their social media accounts. The total number of clusters isalso reported to be formed due to the excess amount of noise prsent in socialmedia data [65]. The number of clusters in Instagram data is notably high. Thisis due to the nature of tag data that contain joined English terms or names(e.g. #BigData or #DataScience). We did not process these tags with naturallanguage processing because our objective was to evaluate the performance ofFGCR. The Instagram data was included to test the FGCR performance tocreate a �ne-grained clustering solution.

Page 18: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

4.2 Evaluation Measures

We propose Dispersion (DS) as an internal evaluation score for the �ne-grainedclustering problems. DS is used to measure how cluster sizes are distributedin the clustering solution. An optimal �ne-grained clustering solution will haveminimal DS value noting that clusters sizes are evenly spread. Clustering so-lutions with a few large clusters and diverse cluster sizes will have higher DSvalues compared to clustering solutions with similar sizes of clusters.

DS can be calculated as the logarithm of maximum average distance betweendocuments and their centroids, multiplied by the coe�cient of variation of clustersizes. Formally DS is calculated as:

DS = log(max{ 1

|Ci|∑d∈Ci

∣∣vd − Ci∣∣ , i = 1, 2, . . . , k} ×√

(|Ci| − µC)2

µC√k

), (9)

where k is the total number of clusters, |Ci| is the cardinality of cluster i, Ci is

the centroid of cluster i, and µC =∑k

i=1 |Ci|k is the average of cluster sizes.

Since DS is expected to be minimal for �ne-grained clustering solution, the�rst term of Equation (9) is a penalty factor that gives higher value for clusteringresults with large cluster(s). The second term of Equation (9) measures relativevariability of the average cluster sizes. Its value will be minimal if cluster sizesare homogeneous and will be higher if cluster sizes are varied.

DS favours a �ne-grained clustering solution, when ground truth values arenot available, we used the total number of clusters k generated from FGCR andapplied this to all other benchmarked clustering algorithms that require k as aninput parameter. By doing this, all clustering algorithms are expected to produceapproximately the same number of clusters (k) and the dispersion evaluation canbe objectively made.

The external evaluation measures used are F1-score and normalized mutualinformation (NMI) [32]. Let TP be the number of similar pairs of documentsthat are assigned to the same clusters, TN is the number of dissimilar pairof documents that were assigned to di�erent clusters, FP represents pairs ofdissimilar documents in the same cluster, and FN represents pairs of similardocuments that are assigned to di�erent clusters. The F1-score is then calculatedas:

F1 =2PR

P +R, where P =

TP

TP + FPand R =

TP

TP + FN. (10)

NMI is used to measure the clustering quality and was calculated as a ratiobetween mutual information (I) and the entropy measure of the given labels(H(C1)) and clustering results (H(C2)):

NMI =2I

H(C1) +H(C2). (11)

The mutual information (I) is calculated using the maximum likelihood esti-mates of the probabilities of whether a document belongs to the labelled group

Page 19: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

or cluster:

I =∑k

∑j

|c1k ∩ c2j |N

logN |c1k ∩ c2j ||c1k||c2j |

. (12)

H is the entropy and is calculated as

H(Cj) = −∑k

cjkN

logcjkN, for j = 1, 2. (13)

4.3 Benchmarks and Experimental Setup

In order to examine the distinctive clustering results between FGCR and otherclustering methods, we compared FGCR with various types of well-establishedclustering algorithms as follows:

• centroid-based partitional clustering algorithms such as standard k-means(km) [30], k-means++ (km++) [2], and web-scale or mini batch k-means(Mbkm) [46],

• density-based clustering such as DBSCAN [11],• spectral clustering [47],• Ward hierarchical clustering (Ward) [62], and• text clustering methods such as topic-based clustering using LDA [4] (fol-lowed by k-means++) and non-negative matrix factorization (NMF) [64].

A comparison with k-means by ranking [5] was not conducted because itrequires k-means++ for its initialization, hence its overall computational per-formance is clearly less than k-means++ alone. Similarly, since our study wasfocused on scalability, a comparison between our approach and hubs-based clus-tering that requires hubs-score calculations and followed by k-means-like algo-rithm [56] was also not performed.

Since FGCR does not use the number of clusters (k) as its input parameter, aspreviously mentioned we used k from the FGCR output as the input parameterfor other benchmarked methods that require k as input parameter, except forDBSCAN that has the capability to derive the total number of clusters naturally.FGCR was set to the default setting i.e., idf query weighting, maximum queryresult sizem = 30, and maximum query length s = 10. We present the sensitivityanalysis in Section 4.6.

Experiments were done using Python 3.5 on a 2.66GHz 64 bit processor.The elastic search engine2 was used to �nd relevant documents for a documentquery. Clustering was done in a serial (one processor) setting, no parallel pro-cessing was utilized in any of the clustering algorithms. We set a threshold on themaximum CPU time in all of the experiments to 3.6× 106 seconds and memoryusage to 128 GB. The benchmarked methods were implemented using the wellknown standard Scikit-Learn module [39]. Visualizations were generated usingMatplotlib [21] and Voyant Tools [49].

2 https://www.elastic.co

Page 20: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

The main objective of the experiments in the following subsection was to ex-amine the computational performance of our proposed clustering algorithm andthe benchmarked algorithms. The validity of the clustering results via externalmetrics and its characteristics through internal metrics and visualizations follow.Finally, sensitivity analysis of the proposed algorithm is given.

4.4 Computational Performance and Evaluations

As elaborated in Section 3.5, one of the main FGCR features is its linear complex-ity that is not a�ected by the number of clusters (k) in the data. To demonstratethe FGCR scalability performance (referred to as processor running time andpeak memory usage), we used incremental sampling (of 10,000 records) on thelargest data in Table 1 (i.e. twitter data).

Fig. 6 exhibits FGCR's running time on the incremental data. It shows thatFGCR's running time grows linearly with the data size N and is not a�ected(i.e. constant) with regards to the number of clusters (k) in the data. This resultis inline with the complexity analysis given in Section 3.5.

Fig. 6: FGCR computational time on the incremental twitter data.

Fig. 7 shows that with regard to computational time and memory consump-tion, FGCR outperforms all of the benchmarked algorithms as the data sizeincreases. The k-means family algorithms are faster for smaller data sizes (andclusters), but they become signi�cantly slower after around N = 120, 000. Simi-larly, DBSCAN is signi�cantly faster initially, but becomes slower compared toFGCR as the data grow. Furthermore, the memory usage of DBSCAN growsrapidly and surpasses the memory limit before the data reach N = 150, 000.Although the memory usage depends on the software's implementation, the ten-dency of the memory usage to rapidly grow is likely to be similar. Moreover,

Page 21: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

the FGCR memory usage grows slowly with the number of records N and thenumber of clusters k (as shown in Fig. 7b).

(a) CPU time comparison. (b) Peak memory usage.

Fig. 7: Performance comparison between FGCR and other benchmarked algo-rithms. Due to the fast growing disparity between the values, both �gures areplotted using a logarithmic scale. Crosses on the chart are aborted processes dueto the reached limitations as set out in Section 4.3.

The memory consumption is signi�cantly lower in FGCR in comparison toMbkm. Despite the use of mini-batch to improve performance, Mbkm still needsto save, recalculate, and update all of the centroids' information. Meanwhile,FGCR only temporarily saves a small set of the most relevant clusters (Rc)information. As the data size and clusters grow, the computation needed inMbkm increases rapidly. On the contrary, FGCR portrays a linear complexitywith regard to N and is not a�ected by the number of clusters k.

Previous results show that Mbkm has the closest computational performanceto FGCR. To elaborate further, these two methods were compared on full sizedatasets. For datasets with ground truth labels, the comparison is summarizedin Table 2, while Table 3 summarizes the comparison for datasets that do nothave the ground truth.

Table 2: Clustering results on full size datasets with ground truth values.

Dataset N K method DS F1 NMI time(s) Mem(MB)

SED13 432,164 13,886FGCR 3.78 0.929 0.708 656 986

Mbkm 6.26 0.846 0.664 5,914 8,141

SED14 358,372 9,532FGCR 3.62 0.929 0.710 522 845

Mbkm 6.27 0.821 0.653 5,914 5,625

Page 22: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

Results in Table 2 show that FGCR is not only faster and uses less computa-tional memory, but also produces higher clustering accuracy (higher F1-score)and better clustering quality (NMI ). FGCR shows a better cluster dispersionnoting that documents are evenly spread across in all clusters instead of produc-ing a few bigger clusters and the rest of very small clusters. Similar results areshown on datasets that do not have ground truth values (Table 3. It is apparentfrom these results that the di�erence in computational e�ciency between FGCRand Mbkm becomes larger as the number of data N and the number of clus-ters k increase. Although Mbkm is able to cluster the largest dataset (twitter),it cannot cluster the Instagram data within the given time limit. The size ofInstagram data is around one third of the size of twitter data, but its numberof clusters is twice that of twitter data. This result indicates the capability ofFGCR in generating the �ne-grained clustering solution.

Table 3: FGCR clustering results on full size datasets without ground truthvalues.

Dataset N k method DS time(s) Mem(MB)

twitter 2,686,877 54,897FGCR 5.67 70,987 4,238

Mbkm 6.13 1,866,613 16,260

Instagram 764,521 100,863FGCR 5.96 8,274 2,314

Mbkm - > 3,600,000 18,898

Google+ 634,344 23,845FGCR 4.90 12,430 2,588

Mbkm 5.73 1,063,852 14,182

facebook 1,615,685 53,756FGCR 6.17 30,499 3,675

Mbkm 6.04 2,332,020 16,415

4.5 Internal Evaluations

In order to allow all of the benchmarked methods to produce evaluations, experi-ments in this section were done using randomly selected (10-fold cross validated)25, 000 samples from all of the datasets detailed in Table 1. These datasets arenotated with * on their names.

As shown by the average dispersion (DS) values in Fig. 8, FGCR distinc-tively creates relatively uniform and �ne-grained clusters. NMF dispersion for*Instagram data is smaller than that of FGCR, yet higher for other datasets.The results show that in general FGCR, DBSCAN, NMF, and LDA seem to cre-ate clusters with relatively homogeneous sizes, while the other methods producesome large clusters and many small ones.

To better understand this phenomenon, Fig. 9 depicts the cluster bubble vi-sualization from the *SED14 dataset. The visualization is done by concentratingthe cluster bubble plots in a dense area. The size of the bubbles corresponds to

Page 23: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

Fig. 8: Dispersion values of all benchmarked algorithms on datasets sampled fromTable 1.

each of the relative cluster sizes from the clustering solution. There is no topo-logical information used in this visualization. Fig. 9 shows that FGCR is able to�nd most of the latent �ne-grained structure within the *SED14 dataset. Fig.9 complements the information provided in Table 2 where FGCR shows a highF1-score and NMI values. Other than FGCR, the �gure shows that only topic-based clustering (LDA) is able to capture the �ne-grained structure. The othermethods create one or two very large clusters and many small ones or singletons.

Fig. 9: Cluster size visualization of the clustering results from *SED14 dataset.

A large collection of social media data includes discussion on numerous topicsand can be considered as heterogeneous in nature. FGCR is most suitable tounderstand this type of data where the total number of groupings is unknown but

Page 24: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

the groupings can be assumed to be �ne-grained. It can be noted that most textclustering algorithms naturally tend to create large clusters. The reason behindthis is as a cluster grows, it is represented by more terms, and hence, there will bemore term intersections between a targeted document and cluster representationof the larger cluster. Hence, a larger cluster yields higher similarity and keepsgetting bigger. FGCR does not have this tendency since clustering decisions arenot based on all of the clusters' information. A cluster locus in FGCR is a smallerpart of the cluster that is found to be most relevant to the targeted document.This makes FGCR una�ected by a cluster size, and in return, the cluster sizedoes not have the tendency to keep growing.

4.6 Sensitivity Analysis

In all of the experiments presented in the previous sections, FGCR parameterswere set to its defaults. This sub-section describes the testing and analysis ofparameters such as query result sizes (m), query representation schemes, andquery lengths (s). The �rst set of experiments is done by varying the size ofreturned results (m) in response to a query document by a search engine.

Results shown in Fig. 10a indicate that the total number of clusters producedby FGCR is decreasing with the increment of the query size m. Nevertheless,it reach a plateau after m reaches around 30. It is important to note that alldatasets showed a similar trend. A smaller m gives smaller DS values, implyinga more evenly distributed clusters (Fig. 10b). However, the smaller DS valuesare due to the large number of (small sized) clusters, as shown in Fig. 10a.Furthermore, the running time grows rapidly as m becomes larger (Fig. 10c).This is expected as a larger m will create a larger S, consequently it will involvemore computation to evaluate more relevant clusters. Based on these analysis,a default m of 30 can be set for future reference.

(a) Number of clusters. (b) Cluster disparity. (c) CPU time.

Fig. 10: The e�ect of di�erent query sizes (m) on FGCR.

The next set of experiments was done by considering several query represen-tations for Q. Various term weighting schemes were used such as tf , idf , tf -idf ,and BM25. We also made a comparison with a more sophisticated documentsummarization technique, TextRank [36]. Results are reported in Table 4 basedon the average of metrics generated with the *SED13 and *SED14 sampled

Page 25: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

datasets. TextRank produced more clusters compared to other term weightings.When evaluated using external metrics and the dispersion measure (DS), idfterm weight for the query generator gives the best results. Hence, in general werecommend the use rare words from documents to be used as the query rep-resentative of documents. It should also be noted that TextRank complexity isdetermined not only by the number of records but also by the number of termsin every documents [36]. Hence, TextRank term weighting is hardly scalable forlarge data.

Table 4: Various query type performance on sampled datasets.

k DS F1 NMI

tf 838 4.073 0.335 0.828idf 999 3.983 0.401 0.849

t�df 938 4.010 0.381 0.844bm25 930 4.018 0.381 0.844txtRank 1,643 4.463 0.385 0.844

Finally, we examined the e�ect of query length (s) to clustering solutionquality. Fig. 11b indicates that a smaller s gives a better DS value, but thisis in�uenced by a larger k value as shown in Fig. 11a. From Fig. 11a and Fig.11b it is apparent that the curves reach a plateau at s around 10. Hence, werecommend this value as a default to be used for future reference.

(a) E�ect on number of clusters (k). (b) E�ect on DS values.

Fig. 11: The e�ect of di�erent query lengths (s) on FGCR.

5 Usability of Clustering: an Example

With the rapid in�uence of social media in everyday life, many individuals,applications or organizations are keen to know the popular topics and their

Page 26: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

relatedness. In this paper, we used social media content (text of the posts) toconduct a clustering analysis and extract these useful information. We show how�ne-grained clustering results obtained from FGCR can help users understandthe dataset and use it in decision making. Fig. 12 illustrates the whole process.

A user �rst speci�es his topic-of-interest to the system. Using FGCR, a setof most relevant and speci�c clusters to the user query is found. Cluster sum-marizations or visualizations can then be used to represent valuable informationin these clusters. This helps users understanding in what context this particularconcept has been talked about in social media.

Fig. 12: Extracting information from FGCR �ne-grained clustering solution.

In this example, we used the FGCR clustering results on a twitter dataset.Suppose a user is interested in information related to the topic �Big Data�.Using this keyword and a search engine indexed with the twitter dataset, themost (but not all) of the relevant clusters to this keyword can be obtained usingthe clustering solution. Cluster summarization such as TextRank [36] can thenbe applied to label these clusters. Fig. 13a shows multiple ways the clusteringoutcomes can be presented. Fig. 13b is an example of word-links visualization [49]that might help the user to understand the clustering insights within a cluster ora set of clusters. Comparing clusters within the small set of most relevant clusterscan also be done using a radar plot of the cluster keywords as demonstratedin Fig. 13c. Finally, a general overview of one or more clusters can also bevisualized using a word cloud as in Fig. 13d. This varied type of information canbe used in multiple ways. For example, a consultant in a health company will beinterested in the �wellness� clusters. Alternatively, a user may want to investigatethe contents of some speci�c clusters. Utilizing the FGCR �ne-grained solution,a user can harvest these information easily.

Page 27: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

(a) Most relevant clusters' topics. (b) Word links of the most relevant clusters.

(c) Radar plot of clusters characterization. (d) Word clouds visualization.

Fig. 13: Example of analytic insights generated from FGCR clustering.

Compared to other clustering approaches FGCR clustering result is uniquewith regard to several factors. First, �ne-grained clustering is produced. Hence�nding the topics of the cluster via topic models such as TextRank [36] or LDA [4]can be done more e�ciently, especially in a large data. Performing these topicmodelling on a large data without using a clustering result would require alot more computational resources. Second, it further facilitates the search ofclusters relevant to a given topic, making it easier for the user to �nd the mostrelevant clusters instead of scanning all of the clusters in the clustering result (asdemonstrated in this section). This is possible due to the ranking concept thatis used in FGCR. Finally, since FGCR is computationally e�cient with regardto the computation process and memory usage, a user can have fast clusteringresults with a standard machine.

6 Conclusion and Future Work

We have introduced FGCR, a novel text clustering via ranking algorithm. FGCRcreates a �ne-grained clustering solution that is suitable to partition a large text

Page 28: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

collection with numerous topics such as social media data. The key bene�t ofFGCR is that its complexity is not a�ected by the total number of clusters in thedata. Using loci and relevant clusters, FGCR does not need to scan all of the datain order to make clustering decisions. The extensive empirical study of FGCRusing several social media datasets and comparing other clustering algorithmsshowed that the proposed method has a signi�cantly higher clustering qualityand requires minimal computational resources.

The discussion in this paper was focused on hard clustering and has beenlimited to social media data. Nevertheless, FGCR can also be used for generaltext datasets. Generalization of FGCR to soft document clustering via a rankingapproach that will work in general text data is left for future work. Applicationof the loci and relevant clusters concept to recommendation problems and incre-mental classi�cation systems are also potential subjects of future study.

References

1. A. Aksyono�. Introduction to Search with Sphinx: From installation to relevancetuning. O'Reilly, 2011.

2. D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding.In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algo-rithms, pages 1027�1035, 2007.

3. K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is �nearest neighbor�meaningful? In Database theory�ICDT '99, pages 217�235. Springer, 1999.

4. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal ofmachine Learning research, 3(Jan):993�1022, 2003.

5. A. Broder, L. Garcia-Pueyo, V. Josifovski, S. Vassilvitskii, and S. Venkatesan.Scalable k-means by ranked retrieval. In Proceedings of the 7th ACM InternationalConference on Web Search and Data Mining, WSDM '14, pages 233�242, NewYork, NY, USA, 2014.

6. H. Chen, R. H. Chiang, and V. C. Storey. Business intelligence and analytics: Frombig data to big impact. MIS quarterly, 36(4):1165�1188, 2012.

7. J. Chen, H.-r. Fang, and Y. Saad. Fast approximate k nn graph construction forhigh dimensional data via recursive lanczos bisection. The Journal of MachineLearning Research, 10:1989�2012, 2009.

8. C. M. De Vries, L. De Vine, S. Geva, and R. Nayak. Parallel streaming signatureem-tree: A clustering algorithm for web scale applications. In Proceedings of the24th International Conference on World Wide Web, pages 216�226. InternationalWorld Wide Web Conferences Steering Committee, 2015.

9. B. Dorow. A graph model for words and their meanings. PhD thesis, Institut fürMaschinelle Sprachverarbeitung der Universit¨ at Stuttgart, 2006.

10. J. Eisenstein, B. O'Connor, N. A. Smith, and E. P. Xing. A latent variable modelfor geographic lexical variation. In Proceedings of the 2010 Conference on Em-pirical Methods in Natural Language Processing, pages 1277�1287. Association forComputational Linguistics, 2010.

11. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm fordiscovering clusters in large spatial databases with noise. In Kdd, volume 96,pages 226�231, 1996.

Page 29: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

12. A. Fahad, N. Alshatri, Z. Tari, A. Alamri, I. Khalil, A. Y. Zomaya, S. Foufou, andA. Bouras. A survey of clustering algorithms for big data: Taxonomy and empiricalanalysis. IEEE transactions on emerging topics in computing, 2(3):267�279, 2014.

13. E. Ferrara, R. Interdonato, and A. Tagarelli. Online popularity and topical intereststhrough the lens of instagram. In Proceedings of the 25th ACM conference onHypertext and social media, pages 24�34. ACM, 2014.

14. N. Fuhr, M. Lechtenfeld, B. Stein, and T. Gollub. The optimum clustering frame-work: Implementing the cluster hypothesis. Inf. Retr., 15(2):93�115, Apr. 2012.

15. M. Gellman and J. R. Turner. Encyclopedia of Behavioral Medicine. Springer,2013.

16. W. He, S. Zha, and L. Li. Social media competitive analysis and text mining: A casestudy in the pizza industry. International Journal of Information Management,33(3):464 � 472, 2013.

17. M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: Scat-ter/gather on retrieval results. In Proceedings of the 19th Annual InternationalACM SIGIR Conference on Research and Development in Information Retrieval,SIGIR '96, pages 76�84, New York, NY, USA, 1996.

18. J. Hou and R. Nayak. The heterogeneous cluster ensemble method using hubnessfor clustering text documents. In WISE 2013, pages 102�110. Springer BerlinHeidelberg, 2013.

19. H. Hu, Y. Wen, T.-S. Chua, and X. Li. Toward scalable systems for big dataanalytics: a technology tutorial. Access, IEEE, 2:652�687, 2014.

20. X. Hu and H. Liu. Text Analytics in Social Media, pages 385�414. Springer US,Boston, MA, 2012.

21. J. D. Hunter. Matplotlib: A 2d graphics environment. Computing In Science &Engineering, 9(3):90�95, 2007.

22. A. K. Jain. Data clustering: 50 years beyond k-means. Pattern recognition letters,31(8):651�666, 2010.

23. N. Jardine and C. J. van Rijsbergen. The use of hierarchic clustering in informationretrieval. Information storage and retrieval, 7(5):217�240, 1971.

24. W. B. Johnson and J. Lindenstrauss. Extensions of lipschitz mappings into ahilbert space. Contemporary mathematics, 26(189-206):1, 1984.

25. A. Katal, M. Wazid, and R. Goudar. Big data: issues, challenges, tools and goodpractices. In Contemporary Computing (IC3), 2013 Sixth International Conferenceon, pages 404�409. IEEE, 2013.

26. F. Klawonn, F. Höppner, and B. Jayaram. What are clusters in high dimensionsand are they di�cult to �nd? In International Workshop on Clustering High-Dimensional Data, pages 14�33. Springer, 2012.

27. H.-P. Kriegel, P. Kröger, and A. Zimek. Clustering high-dimensional data: A surveyon subspace clustering, pattern-based clustering, and correlation clustering. ACMTransactions on Knowledge Discovery from Data (TKDD), 3(1):1, 2009.

28. O. Kurland. The cluster hypothesis in information retrieval. In Proceedings ofthe 36th International ACM SIGIR Conference on Research and Development inInformation Retrieval, SIGIR '13, pages 1126�1126, New York, NY, USA, 2013.

29. A. Leuski. Evaluating document clustering for interactive information retrieval.In Proceedings of the tenth international conference on Information and knowledgemanagement, pages 33�40. ACM, 2001.

30. S. P. Lloyd. Least squares quantization in pcm. Information Theory, IEEE Trans-actions on, 28(2):129�137, 1982.

Page 30: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

31. R. M. Losee and L. A. H. Paris. Measuring search-engine quality and query di�-culty: Ranking with target and freestyle. Journal of the Association for InformationScience and Technology, 50(10):882, 1999.

32. C. D. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval,volume 1. Cambridge university press Cambridge, 2008.

33. C. D. Manning, P. Raghavan, H. Schütze, et al. Introduction to information re-trieval, volume 1. Cambridge university press Cambridge, 2008.

34. A. McCallum, K. Nigam, and L. H. Ungar. E�cient clustering of high-dimensionaldata sets with application to reference matching. In Proceedings of the sixth ACMSIGKDD international conference on Knowledge discovery and data mining, pages169�178. ACM, 2000.

35. W. Medhat, A. Hassan, and H. Korashy. Sentiment analysis algorithms and appli-cations: A survey. Ain Shams Engineering Journal, 5(4):1093 � 1113, 2014.

36. R. Mihalcea and P. Tarau. TextRank: Bringing Order into Texts. In Conferenceon Empirical Methods in Natural Language Processing, Barcelona, Spain, 2004.

37. B. O'Connor, M. Krieger, and D. Ahn. Tweetmotif: Exploratory search and topicsummarization for twitter. In ICWSM, 2010.

38. S. Papadopoulos, Y. Kompatsiaris, A. Vakali, and P. Spyridonos. Communitydetection in social media. Data Mining and Knowledge Discovery, 24(3):515�554,2012.

39. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machinelearning in Python. Journal of Machine Learning Research, 12:2825�2830, 2011.

40. G. Petkos, S. Papadopoulos, V. Mezaris, and Y. Kompatsiaris. Social event detec-tion at mediaeval 2014: Challenges, datasets, and evaluation. In Proceedings of theMediaEval 2014 Multimedia Benchmark Workshop Barcelona, Spain, 2014, 2014.

41. F. Raiber and O. Kurland. Exploring the cluster hypothesis, and cluster-basedretrieval, over the web. In Proceedings of the 21st ACM international conferenceon Information and knowledge management, pages 2507�2510. ACM, 2012.

42. F. Raiber and O. Kurland. Ranking document clusters using markov random �elds.In Proceedings of the 36th international ACM SIGIR conference on Research anddevelopment in information retrieval, pages 333�342, 2013.

43. T. Reuter, S. Papadopoulos, G. Petkos, V. Mezaris, Y. Kompatsiaris, P. Cimiano,C. de Vries, and S. Geva. Social event detection at mediaeval 2013: Challenges,datasets, and evaluation. In Proceedings of the MediaEval 2013 Multimedia Bench-mark Workshop Barcelona, Spain, 2013, 2013.

44. S. E. Robertson and K. S. Jones. Relevance weighting of search terms. Journal ofthe American Society for Information science, 27(3):129�146, 1976.

45. K. D. Rosa, R. Shah, B. Lin, A. Gershman, and R. Frederking. Topical clusteringof tweets. Proceedings of the ACM SIGIR: SWSM, 2011.

46. D. Sculley. Web-scale k-means clustering. In Proceedings of the 19th internationalconference on World wide web, pages 1177�1178. ACM, 2010.

47. J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactionson pattern analysis and machine intelligence, 22(8):888�905, 2000.

48. A. Shirkhorshidi, S. Aghabozorgi, T. Wah, and T. Herawan. Big data clustering:A review. In B. Murgante, S. Misra, A. Rocha, C. Torre, J. Rocha, M. Falcão,D. Taniar, B. Apduhan, and O. Gervasi, editors, Computational Science and ItsApplications � ICCSA 2014, volume 8583 of Lecture Notes in Computer Science,pages 707�720. Springer International Publishing, 2014.

Page 31: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

49. G. R. Sinclair, Stéfan and the Voyant Tools Team. Voyant tools (web application).http://voyant-tools.org/, 2012.

50. A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. InProceedings of the 19th Annual International ACM SIGIR Conference on Researchand Development in Information Retrieval, SIGIR '96, pages 21�29, New York, NY,USA, 1996.

51. M. D. Smucker and J. Allan. A new measure of the cluster hypothesis. In Confer-ence on the Theory of Information Retrieval, pages 281�288. Springer, 2009.

52. A. Spink, D. Wolfram, M. B. Jansen, and T. Saracevic. Searching the web: Thepublic and their queries. Journal of the Association for Information Science andTechnology, 52(3):226�234, 2001.

53. T. Sutanto and R. Nayak. Ranking based clustering for social event detection. InWorking Notes Proceedings of the MediaEval 2014 Workshop, volume 1263, pages1�2. CEUR Workshop Proceedings, 2014.

54. T. Sutanto and R. Nayak. The ranking based constrained document clusteringmethod and its application to social event detection. In Database Systems forAdvanced Applications, pages 47�60. Springer, 2014.

55. T. Sutanto and R. Nayak. Semi-supervised document clustering via loci. InJ. Wang, W. Cellary, D. Wang, H. Wang, S.-C. Chen, T. Li, and Y. Zhang, edi-tors, Web Information Systems Engineering � WISE 2015, volume 9419 of Lec-ture Notes in Computer Science, pages 208�215. Springer International Publishing,2015.

56. N. Toma²ev, M. Radovanovi¢, D. Mladeni¢, and M. Ivanovi¢. The role of hubnessin clustering high-dimensional data. In Advances in Knowledge Discovery and DataMining, pages 183�195. Springer, 2011.

57. N. Toma²ev, M. Radovanovi¢, D. Mladeni¢, and M. Ivanovi¢. Hubness-basedfuzzy measures for high-dimensional k-nearest neighbor classi�cation. Interna-tional Journal of Machine Learning and Cybernetics, 5(3):445�458, 2014.

58. S. Trepte and L. Reinecke. Privacy online: Perspectives on privacy and self-disclosure in the social web. Springer Science & Business Media, 2011.

59. C. Van Rijsbergen. Information Retrieval. London: Butterworths, 2nd edition,1979.

60. E. M. Voorhees. The cluster hypothesis revisited. In Proceedings of the 8th annualinternational ACM SIGIR conference on Research and development in informationretrieval, pages 188�196, 1985.

61. C. Wang, S. S. M. Chow, Q. Wang, K. Ren, and W. Lou. Privacy-preserving publicauditing for secure cloud storage. IEEE Transactions on Computers, 62(2):362�375, Feb 2013.

62. J. H. Ward Jr. Hierarchical grouping to optimize an objective function. Journalof the American statistical association, 58(301):236�244, 1963.

63. M. Widenius and D. Axmark. MySQL reference manual: documentation from thesource. " O'Reilly Media, Inc.", 2002.

64. W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrixfactorization. In Proceedings of the 26th annual international ACM SIGIR confer-ence on Research and development in informaion retrieval, pages 267�273. ACM,2003.

65. J. Yin, S. Karimi, A. Lampert, M. Cameron, B. Robinson, and R. Power. Usingsocial media to enhance emergency situation awareness. In Proceedings of the24th International Conference on Arti�cial Intelligence, IJCAI'15, pages 4234�4238. AAAI Press, 2015.

Page 32: , Article number: 29 1-19.Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other

66. O. Zamir and O. Etzioni. Grouper: a dynamic clustering interface to web searchresults. Computer Networks, 31(11):1361�1374, 1999.