Identifying Overlapping and Hierarchical Thematic

Identifying Overlapping and Hierarchical Thematic Structures

in Networks of Scholarly Papers

A Comparison of Three Approaches

Frank Havemann1lowast Jochen Glaser2 Michael Heinz1 Alexander Struck1

November 3 2021

1 Institut fur Bibliotheks- undInformationswissenschaft Humboldt-Universitatzu Berlin Berlin Germany2 Zentrum Technik und Gesellschaft TechnischeUniversitat Berlin Berlin Germanylowast E-mail Frank (dot) Havemann (at)ibihu-berlinde

Abstract

We implemented three recently proposed ap-proaches to the identification of overlapping andhierarchical substructures in graphs and appliedthe corresponding algorithms to a network of 492information-science papers coupled via their citedsources The thematic substructures obtained andoverlaps produced by the three hierarchical clusteralgorithms were compared to a content-based cate-gorisation which we based on the interpretation oftitles and keywords We defined sets of papers deal-ing with three topics located on different levels ofaggregation h-index webometrics and bibliomet-rics We identified these topics with branches inthe dendrograms produced by the three cluster al-gorithms and compared the overlapping topics theydetected with one another and with the three pre-defined paper sets We discuss the advantages anddrawbacks of applying the three approaches to pa-per networks in research fields

1 Introduction

The delineation of scientific fields is a pertinentproblem of science studies in general and bibliomet-rics in particular (cf eg van Raan 2004 [1 p

39]) Bibliometric research has shown that clustersin networks of papers do not have natural bound-aries (cf Zitt et al 2005 [2]) This is why fieldsmust be delineated by applying thresholds for pa-rameters which are chosen arbitrarily in terms oflsquogood structuresrsquo for the purposes of the analysis athand (cf eg references [3 4])

However the problem of delineation might be aconsequence of the overlap of thematic structuresThe overlap of themes in publications is well knownto science studies Sullivan et al (1977) [5 p 235]observed that in the literature of the field of weakinteraction half of the references were articles out-side the specialty Amsterdamska and Leydesdorff(1989) [6 p 461] provide an example of an arti-cle that targeted two different specialties at onceIf disjoint clusters of co-cited sources (Marshakova1973 [7] Small 1973 [8]) are projected forward totheir citing papers the clusters of citing papersinevitably overlapmdasha phenomenon that has neverbeen explored by bibliometrics Taken togetherthese observations suggest that the sciences con-sist of numerous fields of different sizes that par-tially or totally overlap ie feature hierarchiesas well as mutually overlapping lsquoneighboursrsquo withfuzzy boundaries

If thematic structures have boundaries that arehidden by their overlaps delineation is not impos-sible in principle but rather depends on tools thatenable the identification of overlapping fields andtopics

So far only one such tool namely co-citationanalysis has been applied to the delineation taskHowever it assumes disjoint source clusters and lo-cates thematic overlaps only in citing papers Thisunrealistic assumption makes it unsuitable to de-

tect overlapping topics An attempt to obtain over-lapping thematic structures by singular value de-composition (SVD) of paper-source matrices failedbecause SVD producesmdashat least if arbitrary thresh-olds are avoidedmdashas many thematic substructuresas there are papers [9] which again is an unrealisticassumption

The aim of this paper is to introduce and pre-liminarily assess three algorithms for the identifi-cation of overlapping thematic structures in net-works of papers We derived these algorithms fromthree recently proposed approaches to the detec-tion of overlapping and hierarchical substructuresin networksmdashwhich in network analysis are calledcommunities For a concise description of the cur-rent state of finding communities in networks seethe introduction of reference [10] Our selectionand specification of the general approaches is basedon the assumption that the thematic substructuresboth overlap and build hierarchies

We further had to take into account the infor-mation utilised by the different approaches The-matic structures can be determined top-down us-ing global information or bottom-up using eitherglobal and local or only local information Thiscorresponds to different ways in which scientific per-spectives are used in the construction of thematicstructures Since the production of contributionsto scientific knowledge is based on the interpreta-tion of that knowledge by individual producers [11]thematic structures in paper sets are always con-structed from the individual perspectives of the au-thors A bottom-up approach using only local in-formation enables the reconstruction of thematicstructures from the perspective of those contribut-ing knowledge to these themes The use of global in-formation in the top-down or bottom-up construc-tion of thematic structures eg by spectral andmodularity-based methods [12 p 41 p 27] isakin to including the perspective of lsquooutsidersrsquo ieof authorspapers not contributing to the specifictopic Such a lsquodemocraticrsquo procedure can be justi-fied as well but is likely to lead to different results(for an attempt to justify the global perspective seeKlavans and Boyack 2011 [4])

These considerations made us select three ap-proaches that enable the identification of overlap-ping and hierarchical structures in networks onthe basis of local information A first approach

starts from hard clusters obtained by any cluster-ing method and fractionally assigns the nodes at theborders between clusters to these clusters (cf egWang et al 2009 [13]) Another approach is basedon a hard clustering of links between nodes into dis-joint modules which makes nodes members of allmodules (or communities) that their links belongto (cf eg Ahn et al 2010 [14]) The third ap-proach constructs natural communities of all nodeswhich can overlap with each other by applying agreedy algorithm that maximises local fitness (cfeg Lancichinetti et al 2009 [15])

We introduce our implementations of the threeapproaches and discuss their basic features us-ing a small benchmark graph (the karate clubZachary 1977 [16]) as an example The compar-ative analysis applies the algorithms to a networkof 492 bibliographically coupled papers published2008 in six information-science journals The useof information-science papers enabled the construc-tion of paper sets of selected topics by manually as-signing papers to the topics h-index webometricsand bibliometrics on the basis of titles abstractsand keywords The clustering solutions and theoverlap of modules were then assessed by comparingthem to the paper sets On the basis of this com-parison we discuss advantages and disadvantages ofthe three algorithms

2 Communities in Networks

In network analysis communities are understood ascohesive subgroups of nodes separated from the restof the graph Thus communities can only be foundin networks if there are groups of densely intercon-nected nodes that are only loosely connected to eachother Most community definitions are based onthese two aspects ie cohesion and separation [12pp 83ndash87] In order to apply algorithms for thedetection of communities cohesion and separationmust be defined [17] Owing to the continuous na-ture of the two properties communities cannot bedetected unequivocally Instead structures of vary-ing lsquocommunitynessrsquo can be identified [18]

In the case of thematic structures in networksof papers the communities to be detected do notonly overlap each other but are also hierarchicallyordered For hierarchies of communities both co-hesion and separation can be measured directly in

the dendrogram The simplest measure of separa-tion of a community is the similarity level su atwhich its branch in the dendrogram unites with an-other branch The simplest measure of cohesion ofa community is the similarity level sd at which itsbranch in the dendrogram decays into two branchesbut here low level means high cohesion Thus goodcommunities are those with high level su and lowlevel sd This corresponds nicely to the usual se-lection of long branches as important ones Theyhave large differences su minus sd ie are stable overrelatively large similarity intervals Using this dif-ference we can order branches with respect to theirquality as communities ie combined cohesion andseparation

In our experiments the stability is negatively cor-related with community size Many small branchesare very stable and many larger branches are veryunstable In order to find lsquointerestingrsquo communitieswe plot branch length su minus sd over community sizeand identify communities that are unusually stablefor their size ie are represented by branches farfrom the axes of the plot

An alternative approach to community delin-eation associates cohesion with high internal andseparation with low external degrees of communitymembers The internal degree κin(C Vi) of a nodeVi is defined as the sum of weights of edges linkingthis node with nodes in community C its exter-nal degree κout = κ minus κin where κ is the nodersquostotal degree Radicchi et al (2004) [19] define acommunity in the strong sense as a set of nodes allof which have higher internal than external degreeFor a community in the weak sense they only de-mand that the sum of internal degrees exceeds thesum of external degrees These sums are usuallyreferred to as the internal and external degrees ofcommunity C

kinout(C) =sumViisinC

κinout(C Vi) (1)

These internal and external degrees of a communitycan be used to define its fitness (see the approacheslsquonatural communitiesrsquo and lsquofuzzificationrsquo for appli-cations) By combining cohesion and separationthe fitness measure evaluates the quality of a com-munity in a similar way as su minus sd does based on acommunityrsquos branch in a dendrogram

When applied to overlapping communities themeasures used in the delineation of weak communi-

ties must take the nature of overlaps into accountFollowing Steve Gregory (2011) [20] we distinguishbetween crisp and fuzzy overlapping communitiesIf a network has crisp overlapping communitiesnodes either belong or donrsquot belong to a commu-nity Communities are fuzzy if individualsrsquo gradesof membership vary This type of structure is ap-propriate for the relationship between papers andtopics because most papers cover several topics invarying intensities which led us to the applicationof fuzzy set theory

Fuzzy set theory operates with membershipgrades that are real numbers between zero and onebut does not assume that a nodersquos grades of mem-bership in different sets sum up to unity A nodecould also be a full member in more than one com-munity

To determine whether a fuzzy community C is acommunity in the weak sense we have to redefine itsinternal and external degree kinout(C) by weightingthe degrees with node membership grades Withmicroi(C) = 1 if Vi isin C and microi(C) = 0 otherwisewe can rewrite the definitions given above for crispcommunities as

kin(C) =

nsumij=1

microi(C)aijmicroj(C) (2)

kout(C) =

nsumij=1

microi(C)aij [1minus microj(C)] (3)

where aij is the weight of edge (i j) and n thegraph size These formulae can also be used fora fuzzy community C if microi(C) is identified withnodersquos Vi membership grade in C Then 1minus microi(C)is its membership grade in Crsquos fuzzy complementFuzzy set C is a community in the weak sense ifkin(C) gt kout(C)

3 Three Approaches

We introduce the three approaches and explaintheir basic mechanisms with a simple examplenamely the social network of 34 members of a karateclub analysed by Zachary (1977) [16]1 Members of

1using the unweighted graph httpnetworkxlanl

govexamplesgraphkarate_clubhtml

the karate club were asked about friendship tiesThe network turned out to have two central actorswho after the split of the original club founded sep-arate new clubs Authors who implemented algo-rithms based on the three approaches applied themto the network described by Zachary

31 Natural Communities

A natural community of a node can be constructedby any lsquogreedyrsquo algorithm which evaluates the in-clusion of neighbouring nodes into the communityusing an appropriate metric or fitness function Ifa community with a neighbour node is fitter thanwithout it the neighbour will be included Theessence of this local approach is that independentlyconstructed natural communities of nodes can over-lap Figure 1 shows two overlapping communities ofkarate club members On the left-hand side the rednodersquos community has all yellow and green nodesas its members On the right-hand side the violetnodersquos community has all blue and green nodes asmembers Thus we have five (green) nodes in theoverlap of both natural communities

The idea to identify overlapping communities assub-graphs which are locally optimal with respectto some given metric was first published by Baumeset al (2005) [21] It can be implemented in severalways In the same year Baumes et al [22] testeda combination of two greedy algorithms which bothuse the same metric The initialisation produces

red gets yellow and green

violet gets blue and green

overlap green

overlapping natural communities of nodes

data Lancichinetti Fortunato and Kertesz (New J Phys 11 2009 033015)

Figure 1 Natural communities Karate clubgraph with overlapping communities of twonodes (red and violet)

disjoint seed clusters the metric of which is thenimproved by an iterative procedure leading to over-lapping communities

Lancichinetti et al (2009) [15] combined the con-cept of locally optimal sub-graphs with the idea ofvariable resolution to enable their algorithm to re-veal hierarchical community structures They in-troduced a resolution parameter into their fitnessfunction Higher resolution results in smaller lowerin larger natural communities The fitness func-tion includes only local information It is definedas the ratio of the sum of internal degrees kin(C) tothe sum of all degrees k(C) = kin(C) + kout(C) ofnodes in a community C The denominator is takento the power of α the resolution parameter

f(Cα) =kin(C)

k(C)α (4)

Figure 1 displays a cover of the karate-club net-work obtained by Lancichinetti et al with astochastic version of their algorithm for the reso-lution interval 076 lt α lt 084 Their LFM (localf itness maximisation) algorithm has to be repeatedfor all resolution levels of interest

The construction of a scientific paperrsquos naturalcommunity in a similarity network of papers can beinterpreted as the construction of its thematic envi-ronment from its own lsquoscientific perspectiversquo Thisidea is attractive from a conceptual point of viewbecause it mimics the way in which scientists ap-ply their individual perspectives when constructingtheir fields This is why locality is a realistic as-sumption for topic extraction in paper networks

At the same time the strictly local approach en-ables the local exploration of networks which aretoo big for global analysis like the Web or the com-plete citation network of scientific papers A nodersquosnatural community is a local structure that canbe constructed without knowing the whole graphThe idea to find local community structures with-out knowing the whole graph by using a greedy lo-cal cluster algorithm goes back to Clauset (2005)[23] His procedure can also be used to constructoverlapping graph modules [24] In contrast tothe resolution-depending fitness function of Lanci-chinetti et al (2009) [15] Clauset evaluated mod-ules with a function that does not depend on reso-lution

32 Link Clustering

If links instead of nodes are clustered nodes withmore than one link can be fractional members ofclusters as figure 2 shows for the karate club Forexample vertex 1 (violet point) has four edges be-longing to one and twelve edges belonging to an-other hard cluster of links Thus it has member-ship grades 416 and 1216 respectively in the twoclusters

For clustering links we need a measure of linksimilarity Ahn et al (2010 [14 eq 2 p 5]) chosethe Jaccard index of neighbourhoods of nodes at-tached to two links (a node itself is included intoits neighbourhood)

In a different approach to link clustering Evansand Lambiotte (2009) [25] used the line graph of anundirected graph To get a graphrsquos line graph firsta bipartite graph of the graphrsquos nodes and edges isconstructed by putting an edge node on each edgeThe bipartite graph can then be projected onto theline graph a graph where nodes and edges haveinterchanged their roles

Recently Ball et al (2011) [26] successfully testedan algorithm which finds overlapping node commu-nities with a generative stochastic model of hardlink clusters Kim and Jeong (2011) [27] appliedthe fast Infomap [28] algorithm to link clustering

The clustering of citation links instead of papersis of high interest to bibliometrics because a citationis probably the conceptually most homogenous bib-

cluster of red linksred and green

cluster of blue linksblue and violet

cluster of black links black green and violet nodes

green nodes in overlap

violet node in overlap

link clustering

data Evans and Lambiotte Phys Rev E 80 (2009) 016105

Figure 2 Hard clusters of links Karate clubgraph with overlapping node communities inducedby three hard link clusters

liometric unit Since many references are referred toonly once in a paper it can be assumed that theselinks between the citing and the cited publicationcan be assigned to one theme Even though thereare many cases in which a paper cites a source forseveral different reasons a citation link can be as-sumed to have a higher thematic homogeneity thana publication Based on this assumption of homo-geneity citation links can be hard-clustered whichleads to overlapping clusters of papers The mem-bership grade of a paper to a module correspondsto the part of outgoing citation links of this paperwithin this link cluster

We applied the hierachical link clustering (HLC)method suggested by Ahn et al (2010) [14] to clus-ter citation links in the bipartite network of papersand their cited sources Ghosh et al (2011) [29]have generalised HLC to tripartite graphs

We did not consider the line-graph approach be-cause it is not local (due to its use of modularity)

33 Fuzzification of Hard Clusters

The approach assumes that hard-cluster algorithmsvalidly identify disjoint community cores which justneed to be lsquosoftenedrsquo at the borders If this is thecase modifying a hard cluster by evaluating the in-clusion of its nodes and neighbouring nodes with re-gard to some metric or fitness is a plausible methodfor constructing overlapping communities The fit-ness balance of a node with respect to a cluster can

cluster of red and green nodes

cluster of blue and green nodes

overlap green border nodes

fuzzification of hard clusters

data Wang Jiao and Wu Physica A 388 (2009) 5045minus5056

Figure 3 Fuzzification of hard clustersKarate club graph with overlapping communitiesfrom two hard clusters

then be used to decide about its membership and tocalculate its membership grade Thus we constructfuzzy overlapping communities

Figure 3 shows the karate-club result Wang et al(2009) [13] obtained with their implementation ofthe fuzzification approach which they applied totwo hard clusters

However the simplest approach to making hardclusters overlapping and fuzzy is to redefine themembership grades of all nodes with links cross-ing borders If some of a nodersquos links end withincluster C and some outside C then its membershipin C can be defined as the ratio κin(C Vi)κ(Vi)We use this definition to calculate fractional gradesafter we constructed overlapping communities byfitness improvement Such an approach can be crit-icised as being inconsistent because fractional mem-berships are calculated using non-fractional (zero orfull) memberships of nodes as obtained by fitnessevaluation

4 Data

41 The Paper Network

We apply the three algorithms to a network of pa-pers in the 2008 volume of six information-sciencejournals with a high proportion of bibliometrics pa-pers (for details of data see reference [30] papersdownloaded from Web of Science)

We start from the affiliation matrix M of the bi-partite network of papers and their cited sourcesHere we neglect that a few cited sources are alsociting papers in the 2008 volume but this minor sim-plification can be avoided in future analyses Linkclustering is done with M itself the other two al-gorithms analyse a bibliographic-coupling networkconstructed from M as follows In the network ofpapers two nodes (papers) are linked (bibliograph-ically coupled) if they both have at least one citedsource in common To account for different lengthsof reference lists we normalise the paper vectors ofM to an Euclidean length of one With this normal-isation the element aij of matrix A = MMT equalsSaltons cosine index of bibliographic coupling be-tween paper i and j

The symmetric adjacency matrix A describes aweighted undirected network of bibliographicallycoupled papers The main component of the net-

Hirsch indexbibliometricswebometricsbibminusweb overlap

Figure 4 Information science 2008 Threetopics and their overlaps in a network of 492bibliographically coupled papers Topics assignedmanually to papers by inspection of theirkeywords titles and abstracts The nodesrsquo colourscorrespond to four sets (i) green to h-index (ii)blue to bibliometrics without h-index and withoutwebometrics (iii) red to webometrics withoutbibliometrics and (iv) violet to the overlap ofwebometrics and bibliometrics Transparent nodesare papers dealing with other information-sciencetopics mainly with information retrieval andinformation behaviour

work of 533 information-science papers 2008 (528articles and five letters) contains 492 papers Threesmall components and 34 isolated papers are of nointerest for our cluster experiments

42 Three Topics

For the evaluation of the three algorithms we usedkeywords titles and abstracts of papers to identifythose that belong to three topics namely h-indexbibliometrics and webometrics The h-index is anindicator for the evaluation of a researcherrsquos perfor-mance which has been proposed by the physicistJ E Hirsch in 2005 Since then the use of theh-index for evaluating individual researchers pro-

posals for h-index derivatives and for h-indices ofjournals or other aggregates of papers have beendiscussed in the literature 46 of the 492 paperscite the 2005 paper by Hirsch which is the mostcited source in our sample The h-index is clearlyan invention in the field of bibliometrics About200 other papers are also addressing bibliometricthemes For the purposes of this evaluation weexcluded analyses of patents from bibliometricsIn a smaller webometrics set internet activities of(mainly academic) institutions and individuals areanalysed

We first assigned papers to the three topics on thebasis of their keywords and subsequently checkedthe classification by inspecting titles and abstractsThis led to 42 papers assigned to the h-index andits derivatives further 182 bibliometric papers notmentioning the h-index in title or abstract 24 we-bometric papers and eight papers in the overlapbetween webometrics and bibliometrics In figure4 we display the graph of the sample of 492 biblio-graphically coupled papers using the force-directedplacement algorithm by Fruchterman and Reingold(as implemented in the R-package igraph)2

5 Fuzzy Natural Communities

51 MONC Algorithm

MONC [30] uses ideas from Lancichinetti et al(2009) [15] but replaces their numerical approachby a faster and more precise parameter-free analyt-ical solution

Lancichinetti et al proposed an algorithm whichrests on a greedy expansion of natural communi-ties of nodes by local fitness maximisation (LFMalgorithm) Communities of different nodes canoverlap each other The size of a natural com-munity of a node depends on resolution α LFMhas to be repeated for each relevant resolutionlevel to reveal the hierarchical structure of the net-work Our parameter-free MONC algorithm ex-actly calculates resolution levels at which commu-nities change by including a node that improvestheir fitness To save further computing timeMONC merges overlapping natural communitieswhen they become identical during the iterationprocess [30]

2cf httpwwwr-projectorg

Similar to Lee et al (2010) [31]mdashwho tested avariant of LFMmdashwe found that using cliques asseeds gives better results than starting from singlenodes While Lee et al use maximal cliques (iecliques which are not sub-graphs of other cliques)we optimise clique size by excluding nodes that areonly weakly integrated [30 p 6]

From MONC results we construct fuzzy naturalcommunities of nodes ie fuzzy sets in which eachgraph node has a definite membership grade Eachfuzzy natural community represents its seed nodersquosperspective on the whole network Since the em-phasis on local perspectives lets MONC constructmany natural communities that are very similarthe fuzzy natural communities are hard-clusteredhierarchically using the fuzzy-set Jaccard index asa similarity measure for eg single-linkage cluster-ing Branches in dendrograms derived from MONCresults do not represent disjoint sets of nodes butoverlapping fuzzy communities

52 MONC Post-Processing

Greedy algorithms which locally maximise a reso-lution depending fitness (or density) function canreveal hierarchies of overlapping modules Lanci-chinetti et al (2009) [15 pp 7ndash9] have successfullytested their LFM algorithm on a simple benchmarkgraph with two hierarchical levels

MONC (like LFM) needs some postprocessing toreveal a graphrsquos hierarchy We successfully testedthe following procedure for detecting a graphrsquos hi-erarchy from MONC results We define the grade anode is a member in a community from the resolu-tion level at which it becomes a member (cf nextsubsection) With this definition communities be-come fuzzy sets over the universe of all nodes calledfuzzy natural communities Two communities aresimilar if their fuzzy intersection is large As a rel-ative measure we use the fuzzy Jaccard index todefine the similarity of two natural communitiesThen communities can be clustered by any hard-cluster algorithm to reveal the graphrsquos hierarchy

Here we should add a comment We constructa nodersquos perspective on the whole graph ie itsnatural community as a fuzzy set over the universeof all nodes We hierarchically cluster the fuzzysets which is equivalent to node clustering basedon a variant of the concept of structural equiva-lence [12 p 86] Nodes are structurally equivalent

if their neighbourhoods are equal they are struc-turally similar if their neighbourhoods are similarin some sense We operationalise structural simi-larity of two nodes as the fuzzy Jaccard index oftheir fuzzy natural communities representing theirperspectives on the whole graph Despite the equiv-alence of our method to the concept of structuralsimilarity of nodes we insist on the definitions givenabove we do not cluster nodes but their fuzzy nat-ural communities

We have also to comment on our earlier claimMONC would reveal the graphrsquos hierarchy automat-ically ie without postprocessing [30] MONCrsquosmerging of overlapping natural communities canbe visualised in a dendrogram Two communitiesmerge at a resolution level where they become iden-tical sets of nodes In the case of Zacharyrsquos karateclub we discussed graph covers on different resolu-tion levels by going through this dendrogram start-ing from its root This inspection of a dendrogramof merging communities seemed to confirm our ex-pectation that MONCrsquos community merging is de-termined by the graphrsquos actual hierarchy [30 pp11ndash13] Tests with the information-science networkconvinced us that this assumption is not true Iftwo communities merge at a low level of inverseresolution γ = 1α they are very similar and aretherefore located close to each other in the graphrsquoshierarchy But merging at low γ is not a necessarycondition for communities to be similar There aremany very similar communities which merge at highγ because of very small differences in membershipup to this level This phenomenon leads to manynear-duplicates when we determine modules at oneselected resolution level (as we have already foundwhen testing MONC on non-hierarchical bench-mark graphs [30 p 16])

53 Grades of Membership

MONCrsquos greedy expansion of seeds can be discussedin terms of lsquohosts inviting guestsrsquo to their commu-nities Each node i of the (connected) graph is lsquoin-vitedrsquo to each community j at some level of inverseresolution γij To construct fuzzy communities withvarious grades of node membership we propose todefine the membership grade of node i in the com-munity of node j as

microij = exp(minusγ2ij) (5)

0 100 200 300 400 500

MONC communities

size (nr of full members)

hminusindexbibliometrics

webometrics

Figure 5 Stability over size of all MONCbranch communities Stable communitiescorresponding to our three topics ininformation-science papers 2008 are markedbibliometrics (blue) webometrics (red) h-index(green)

Using the decreasing exponential function ofsquared γij (as in the density function of the normaldistribution) ensures that (1) the host is full mem-ber in its own community (microjj = 1) (2) lsquolate guestsrsquoget lower grades and (3) the lsquofirst guestsrsquo get mem-bership grades near one (the function starts fromone its derivation from zero)

We assume that the dendrogram of fuzzy naturalcommunities reflects the graphrsquos hierarchical struc-ture For each branch we define a community as thefuzzy union of all fuzzy sets of the branchrsquos nodesThis means that all host nodes of the branch arefull members of the branch community This defi-nition ensures the hierarchical order of branches iftwo branches unite then their communities are fuzzysubsets of their fuzzy union Thus each branch ofthe dendrogram of fuzzy natural communities ieeach vertical line represents a fuzzy community

Figure 6 MONC communities of the threetopics in information-science papers 2008 (a)h-index (b) bibliometrics (c) webometricsSaturation of points correlates with membershipgrade Colours of circles denote manuallydetermined topics (cf fig 4)

0 100 200 300 400 500

hminusindex branch

number of nodes

0 100 200 300 400 500

bibliometrics branch

number of nodes

0 100 200 300 400 500

webometrics branch

number of nodes

Figure 7 Scree plots of MONCcommunities of the three topics ininformation-science papers 2008 (a) h-index (b)bibliometrics (c) webometrics Red lines markpossible thresholds

54 MONC Communities of Topics

For each branch community we plot its stability ieits branchrsquos length su minus sd over community sizewhich is estimated by the number of full members(figure 5 cf also above p 2) Three of the out-liers correspond to our predefined topics and willbe evaluated in comparison with results of the otheralgorithms (see below section Comparison of Algo-rithms) The two stable communities with about400 full members unite bibliometrics webometricsinformation retrieval and some other smaller topicsbut do not include a set of less central graph nodes

Figure 6 shows the relationship between MONCcommunities corresponding to the three topics andthe manually determined paper sets of topics Allgrades below a threshold are set to zero We de-rive the thresholds from scree plots of membershipgrades (figure 7) These plots show that at somecritical membership grade the node sets of eachbranch inflate to nearly the whole graph We arguethat this inflation marks the border of a communityof a branch We set the gradersquos threshold on a valuethat cuts the scree at the last steepest gradient be-fore inflation (microthr = 229 355 1 for h-index bib-liometrics and webometrics respectively)

6 Hierarchical Clusteringof Citation Links

61 HLC Algorithmon Bipartite Graphs

We consider the bipartite network of papers andcited sources Citation links between the two typesof nodes can be hard-clustered which leads to in-duced overlapping communities of papers (and alsoto communities of sources which however are notanalysed here) The membership grade of a paperto a thematic community equals the fraction of itscitation links belonging to the corresponding linkcluster

Links can be seen as similar if the neighbourhoodsof their nodes overlap to a high degree Thus theJaccard index of these neighbourhoods can be usedas a similarity measure (cf Ahn et al 2010 [14eq 2 p 5]) We discuss the definition of similaritybetween links in a bipartite graph in terms of papersand cited sources The neighbourhood of a paper

pi is the set of its references Ri the neighbourhoodof a cited source si is the set of papers Ci citing itThe neighbourhood Ni of citation link i is then theunion of these disjoint3 sets Ni = RicupCi The sizeof the intersect of two link neighbourhoods is givenby

|Ni capNj | = |Ci cap Cj |+ |Ri capRj | (6)

and the size of their union by

|Ni cupNj | = |Ci cup Cj |+ |Ri cupRj | (7)

The distance metrics used for clustering is then

dij = 1minus |Ci cap Cj |+ |Ri capRj ||Ci cup Cj |+ |Ri cupRj |

Ahn et al calculate similarities only for link pairswhich have a node in common because they ldquoex-pectrdquo disconnected link pairs to be less similar thenpairs connected over a node [14 p 5] Since coun-terexamples disproving this assumption can be con-structed we decided to calculate similarities for allpairs of nodes Such a procedure uses more infor-mation but is also more time-consuming

Hard clustering of links can be done with any hi-erarchical clustering method We tested four stan-dard methods The dendrograms of Ward and av-erage clustering of citation links seem to reflect thegraphrsquos hierarchy more adequately than those ofsingle-linkage and complete-linkage clustering Thelatter two methods impose too low or too high re-strictions respectively on finding clusters

62 HLC Communities of Topics

For all pairs of citation links from the 492 citingpapers to all sources we determine link similaritiesPairs of citation links to sources cited only oncehave zero distance within their reference list (cfequation 8) They are clustered at zero-distancelevel with one another Next these zero-distanceclusters are joined with links to the least citedsource in their reference list This allows us to re-strict clustering to all m = 5005 citation links tosources which are cited more then once

We applied the average-clustering method to thisset The corresponding dendrogram does not givea clear picture of the graphrsquos hierarchy unless we

3We neglect that a few cited sources are also citing paperscf above

0 100 200 300 400 500

HLC communities

size (nr of members fractionally counted)

hminusindexwebometrics

bibliometrics

Figure 8 Stability over size of all 5004 HLCbranch communities Stable communitiescorresponding to our three topics ininformation-science papers 2008 are markedbibliometrics as blue webometrics as red andh-index as green point respectively

re-parametrise the distance axis We choose d rarrdlog2m to de-skew distances d Using these rescaleddata we plot branch length over community sizeto find relatively stable and large communities (fig-ure 8) We measure community size by the sumof fractional membership grades of papers attachedto the clustered citation links Like in the MONCcase we find our three topics as exceptional pointsin the plot

Figure 9 shows the graph of 492 papers colouredproportional to their membership grade in the threetopics respectively

7 Fuzzification ofHard Clusters

71 Fuzzification Algorithm

The fuzzification approach assumes that hard clus-ter algorithms validly identify disjoint communitycores which just need to be lsquosoftenedrsquo at the bor-

Figure 9 HLC communities of the threetopics in information-science papers 2008 (a)h-index (b) bibliometrics (c) webometricsSaturation of points correlates with membershipgrade Colours of circles denote manuallydetermined topics (cf fig 4)

0 100 200 300 400 500

Ward hard clusters for fitness evaluation

bibliometrics

hminusindex

webometrics

Figure 10 Stability over size of branchcommunities Stable Ward clusterscorresponding to our three topics ininformation-science papers 2008 are markedbibliometrics as blue webometrics as red andh-index as green point respectively

ders We implemented an algorithm that evaluatesborder nodes of each hard cluster with regard totheir connectiveness with it Border nodes haveedges crossing the clusterrsquos border and can be lo-cated inside or outside the cluster

The algorithm uses an evaluation metric thatis based on the fitness function defined by Lanci-chinetti et al (2009) [1315] (see above equation 4page 4) For each border node of a cluster we cal-culate the clustersrsquos fitness with and without thisnode The fitness balance of a node with respectto a cluster determines its membership Negativebalance means exclusion from the cluster We eval-uate all border nodes of a cluster without changingit during the evaluation (in contrast to the greedyLFM algorithm which updates the community afterdeciding about a nodersquos membership) The fitness-inherent resolution parameter controls the extent ofthe overlap where lower values cause a wider areato be considered for the inclusion into the formerhard cluster While MONC uses resolution levelsto calculate membership grades the fitness-inherent

Figure 11 Fuzzy communities of the threetopics in information-science papers 2008 (a)h-index (b) bibliometrics (c) webometricsSaturation of points correlates with membershipgrade Colours of circles denote manuallydetermined topics (cf fig 4)

parameter is arbitrary here Thus we didnrsquot applyit in our comparison and set α = 1 In a second stepthe crisp overlapping communities are made fuzzyThe fractional membership grade of a node could bedefined using its fitness balance as input but this didnot lead to fuzzy communities that match the threepredefined topics Hence we used a definition thatignores the value of (positive) fitness balances Themembership grade microi(C) of vertex Vi in communityC is

microi(C) = κin(C Vi)κ(Vi) (9)

where κin(C Vi) is the sum of weights of edges be-tween vertex Vi and vertices in C and κ(Vi) the sumof all its edge weights

72 Fuzzy Communities of Topics

We applied standard Ward and average clusteringon the network of n = 492 bibliographically coupledinformation-science papers Complete and singlelinkage failed to provide acceptable results as canbe already deduced from the dendrograms Aver-age clustering also results in a dendrogram whichis not easy to interpret The Ward dendrogramshows a very stable and clear h-index cluster whichis united with the rest of the graph in the last merg-ing step (cf figure 10) Fitness-based optimisa-tion with resolution α = 1 enlarges this cluster ex-tremely and lowers precision without gain in recallwith respect to the set of manually selected h-indexpapers Thus fitness maximisation is not a suc-cessful strategy for this topic that has been wellmatched (by eg Ward) clustering already and ishighly connected to its network environment

If we omit fitness maximisation and only cal-culate fractional membership grades according toequation 9 the result is not better Many externalborder nodes become partial members of the fuzzyh-index community

On the other hand the hard bibliometrics clus-ter is much smaller than expected and needs fit-ness maximisation or at least fractional membershipgrades to match the topic Figure 11 shows how thefitness-optimised and fuzzyfied Ward clusters of thethree topics fit the topics as paper sets

8 Comparison of Algorithms

81 Comparison of theIdentified Communities

To compare the results we calculate fuzzy Saltonrsquoscosine of manually defined topics with fuzzy com-munities identified by the three algorithms consid-ered In addition table 1 gives values of fuzzy kinand kout the geometric mean of which equals thecosine Table 2 shows how fuzzy communities con-structed by the algorithms overlap each other Intable 3 we list the fuzzy internal and external de-grees kin(C) and kout (cf equations 2 and 3 p 3)together with their ratio kin(C)kout(C) for eachfuzzy topic community constructed by the three al-gorithms All fuzzy communities are communitiesin the weak sense The ratio can be interpreted asa measure of lsquocommunitynessrsquo

The assumption that hard clusters can be im-proved by fitness-based optimisation and fuzzifica-tion could not be validated with Ward clusters asinput While the optimised and fuzzified bibliomet-rics cluster gained slightly better similarity resultsthan the other two algorithms the clearly identifiedh-index hard-cluster did not improve because bothoptimisation and fuzzification included too manynodes which were related but were not assigned tothe topic The fitness-inherent resolution parametercould improve similarity values but would have tobe choosen differently for different clustersmdasha pro-cedure which cannot be applied when target topicsare not known in advance The fact that fuzzifica-tion results in an h-index community with best ratiokin(C)kout(C) should be interpreted with care Itonly means that the algorithm finds a big clusterwhich is relatively separated from the rest of thegraph It (partly) includes many papers which donot refer to the h-index

Hierarchical clustering of citation links gave bet-ter results than MONC Link clustering classifies h-index as a bibliometric topic whereas MONC onlyincludes some h-index papers into bibliometricsFuzzy cosines of HLC communities and manuallyselected topics are always better than the corre-sponding MONC values (s table 1)

82 Assumptions Used

All three algorithms implemented by us are basedon the assumptions that a graphrsquos communities(1) are best determined locally (2) overlap eachother (3) are best described by fractional mem-bership grades and (4) form a hierarchy We usedthese four assumptions as criteria in our selection ofapproaches to community detection Nonethelesswith respect to all four criteria there are differencesbetween the selected algorithms Another criterion

Table 1 Topic matches by algorithms

topic MONC HLC fuzzyh-index 71 93 59

precision 56 91 35recall 89 95 100

bibliometrics 79 82 83precision 72 83 87

recall 86 81 80webometrics 58 60 46

bib-web overlap 46 29 30precision 34 24 14

recall 64 36 65

Fuzzy cosine indices precision and recall of papersets and fuzzy communities (and ofbibliometrics-webometrics overlap) found by thethree algorithms

Table 2 Community matching betweenalgorithms

MONC HLC fuzzytopic HLC fuzzy MONC

h-index 73 60 62bibliometrics 76 84 78webometrics 63 46 55

bib-web overlap 51 41 43

Fuzzy cosine indices of fuzzy communities (and ofbibliometrics-webometrics overlap) found by thethree algorithms

was that results should notmdashat least not stronglymdashdepend on arbitrary parameters Furthermore eachof the algorithms is based on specific assumptionswhich we already mentioned in the respective sec-tions of this paper

The fuzzification procedure based on hardclusters whose fitness is improved assumes that thehierarchical cluster algorithm delivers essentiallyvalid but improvable hard clusters Our results donot confirm the improvement using standard fitnessmeasures For the membership grades we have notfound a local consistent and realistic definitionWith respect to its input data this proceduremdashlikeMONC but unlike HLCmdashassumes that a networkof scholarly papers weighted with paper similarity(based on references andor text) can be used toidentify hierarchical thematic structures

Hierarchical link clustering Paper networksare projections of bipartite graphs and thus do notuse the full information content of the raw data Hi-erarchical link clustering (HLC) rests on a broaderinformation basis when applied to links in bipar-tite networks of papers and their cited sources orin tripartite networks of papers cited sources andterms used in papers and sources HLC only as-sumes that a source is cited for only one reason orfor only very few similar reasons in one paper Inthe case of terms the assumption is that authorsuse one term in one paper with only one meaningThese assumptions are both not only very plausi-ble but could even be tested in case studies A

Table 3 Fuzzy kinkout of communities

C variable MONC HLC fuzzyh-index kinkout 597 741 970

kin 24465 24566 35221kout 4098 3317 3631

biblio- kinkout 341 1903 1537metrics kin 31423 46697 45697

kout 9203 2454 2974webo- kinkout 143 121 132

metrics kin 2104 1074 4501kout 1471 885 3419

The ratio kin(C)kout(C) kin(C) and kout(C) offuzzy communities found by the three algorithms

further advantage of link clustering is that it al-lows to combine citation and textual information intripartite graphsmdasha very lsquonaturalrsquo solution of thislongstanding problem (cf eg the introduction ofreference [32] and sources cited there)

For MONC there are no further assumptions be-yond the four mentioned above and the one aboutpaper-similarity networks However we found thatMONC needs some post-processing to reveal the hi-erarchy of a graph Thus it is assumed that hierar-chical clustering of the nodesrsquo perspectives resultsin a realistic hierarchy of topics

83 Methodological Aspects

Our implementations of the three approaches tooverlapping communities all involve a hard clus-tering procedure The fuzzification algorithm useshard clusters of nodes as input ie clustering hasto be done as pre-processing Hard clustering offuzzy natural communities is part of MONCrsquos post-processing In the case of HLC the agorithm itselfis a hard clustering procedure For HLC we onlyneed to calculate link similarities as some kind ofpre-processing

We have presented results obtained with only onestandard hard-cluster algorithm per approach buttested also other ones Fuzzification and link clus-tering also worked with Louvain algorithm [33] butwe abandoned this fast modularity-driven methoddue to its use of global information (and the poor hi-erarchical structure obtained) Fuzzification couldperform better with average-linkage clustering butits dendrogram showed only very small stable com-munities In the case of average link clustering(HLC) we succeeded in finding relatively stablecommunities of some size after re-parametrising thedendrogramrsquos similarity axis

When it comes to defining grades of a nodersquosmemberships in different communities link cluster-ing implies a very plausible and consistent defini-tion MONC membership grades could also be de-fined alternatively to the ansatz used here (equa-tion 5) We see this methodological ambiguity as adisadvantage (arbitrary parameters are only a spe-cial case of such an ambiguity) In our fuzzifica-tion experiments we calculated fractional member-ship grades using non-fractional (zero or full) mem-bership of nodes as input (equation 9) This incon-sistent definition could possibly be avoided by an

iterative algorithm that in each step uses the frac-tional grades as input to calculate new ones We didnot yet test such an iteration procedure and hencedo not know whether it would converge or not Analternative would be to use the fitness balances of anode as input for a membership definition Our at-tempts to define grades this way led to communitieswith only very few full members which could be adesired feature for topic extraction that cannot beachieved by HLC membership grades

MONC membership grades fit into the frameworkof fuzzy set theory because a nodersquos grades in gen-eral do not sum up to unity Link clustering leads tonode grades which are normalised Thus an HLCgrade is more adequately interpreted as a probabil-ity

9 Discussion

We implemented three local approaches to the iden-tification of overlapping and hierarchically orderedcommunities in networks as algorithms and testedtheir ability to extract manually defined thematicsubstructures from a network of information-sciencepapers and their cited sources

Hierarchical clustering of citation links proved tobe the most satisfactory approachmdashwith regard tothe test results to its methodological simplicity toits ability to work with the broadest informationbasis (the bipartite graph of papers and sources)and to its potential for a simple inclusion of textinformation in addition to citation datamdashan issueon top of our agenda

Clustering citation links does not need to be re-stricted to a small period of time but can also beapplied to a longer time period This might makeit possible to solve the problem of tracing the de-velopment of topics over time The only limitationHLC encounters is the limited coverage of publica-tion databases ie the existence of citation links topublications that are not included in the database

MONC was found to be useful for overcoming thelongstanding problem of field delineation by greed-ily expanding the paper set downloaded from a ci-tation database [30 p 19] Instead of delineatingresearch fields by journal sets they can be identifiedwith a large enough natural communitymdashobtainedwith low enough resolutionmdashof an appropriate seednode

Hierarchical clustering of citation links can be ap-plied to this problem too We only have to clus-ter the environment of a seed set and then to omitall branches in the dendrogram which are not sub-branches of the most stable community The nextiteration starts from this reduced paper set As inthe MONC case we would alternate between expan-sion (inclusion of neighbours) and partial reductionof the sample (exclusion of neighbours which do notimprove community fitness ormdashin the HLC casemdashstability)

While the fuzzification algorithm only sometimescreates good clusters in terms of target topics iter-ating fitness-based optimisation may lead to moreconsistent clusters by removing loosely connectednodes If the iteration is done node by node thisleads to a version of the LFM algorithm [15] ap-plied to hard clusters instead of single nodes orcliques an approach already proposed by Baumeset al (2005) [22]

Acknowledgements

This work is part of a project in which we developmethods for measuring the diversity of researchThe project is funded by the German Ministry forEducation and Research (BMBF) We would like tothank all developers of R4

Author Contributions

Conceived and designed the experiments all au-thors Performed the experiments AS (fuzzifica-tion) MH (link clustering) FH (MONC) Analysedthe data AS (fuzzification) MH (link clusteringcomparison) FH (MONC) Wrote the paper FHJG Discussed the text all authors

References

1 Van Raan A (2004) Measuring science InMoed HF Glanzel W Schmoch U editorsHandbook of quantitative science and tech-nology research The use of publication andpatent statistics in studies of SampT systemsDordrecht etc Kluwer chapter 11 pp 19ndash50

4httpwwwr-projectorg

2 Zitt M Ramanana-Rahary S BassecoulardE (2005) Relativity of citation performanceand excellence measures From cross-field tocross-scale effects of field-normalisation Sci-entometrics 63 373ndash401

3 Janssens F Glanzel W De Moor B (2008) Ahybrid mapping of information science Sci-entometrics 75 607ndash631

4 Klavans R Boyack K (2011) Using globalmapping to create more accurate document-level maps of research fields Journal ofthe American Society for Information Scienceand Technology 62 1ndash18

5 Sullivan D White D Barboni E (1977) Co-citation analyses of science An evaluationSocial Studies of Science 7 223ndash240

6 Amsterdamska O Leydesdorff L (1989) Ci-tations Indicators of significance Sciento-metrics 15 449ndash471

7 Marshakova IV (1973) Sistema svyazeymezhdu dokumentami postroyennaya na os-nove ssylok (po ukazatelyu ldquoScience CitationIndexldquo) Nauchno-Tekhnicheskaya Informat-siya Seriya 2 ndash Informatsionnye Protsessy iSistemy 6 3ndash8

8 Small H (1973) Co-citation in the ScientificLiterature A New Measure of the Rela-tionship Between Two Documents Journalof the American Society for Information Sci-ence 24 265ndash269

9 Mitesser O Heinz M Havemann F Gla-ser J (2008) Measuring Diversity of Re-search by Extracting Latent Themes fromBipartite Networks of Papers and Refer-ences In Kretschmer H Havemann Feditors Proceedings of WIS 2008 FourthInternational Conference on WebometricsInformetrics and Scientometrics amp NinthCOLLNET Meeting Humboldt-Universitatzu Berlin Berlin Gesellschaft fur Wis-senschaftsforschung ISBN 978-3-934682-45-0 httpwwwcollnetdeBerlin-2008

MitesserWIS2008mdrpdf

10 Lancichinetti A Radicchi F Ramasco JJFortunato S (2011) Finding statistically sig-nificant communities in networks PLoS ONE6 e18961

11 Glaser J (2006) Wissenschaftliche Produk-tionsgemeinschaften Die soziale Ordnungder Forschung Campus Verlag

12 Fortunato S (2010) Community detection ingraphs Physics Reports 486 75ndash174

13 Wang X Jiao L Wu J (2009) Adjusting fromdisjoint to overlapping community detectionof complex networks Physica A StatisticalMechanics and its Applications 388 5045ndash5056

14 Ahn Y Bagrow J Lehmann S (2010) Linkcommunities reveal multiscale complexity innetworks Nature 466 761ndash764

15 Lancichinetti A Fortunato S Kertesz J(2009) Detecting the overlapping and hierar-chical community structure in complex net-works New Journal of Physics 11 033015

16 Zachary W (1977) An information flow modelfor conflict and fission in small groups Jour-nal of Anthropological Research 33 452ndash473

17 Tibely G (2011) Criterions for locally densesubgraphs Arxiv preprint arXiv11033397

18 Friggeri A Chelius G Fleury E (2011)Ego-munities exploring socially cohesiveperson-based communities Arxiv preprintarXiv11022623

19 Radicchi F Castellano C Cecconi F LoretoV Parisi D (2004) Defining and identifyingcommunities in networks Proceedings of theNational Academy of Sciences of the UnitedStates of America 101 2658ndash2663

20 Gregory S (2011) Fuzzy overlapping com-munities in networks Journal of Statisti-cal Mechanics Theory and Experiment 2011P02017

21 Baumes J Goldberg M Krishnamoorthy MMagdon-Ismail M Preston N (2005) Finding

communities by clustering a graph into over-lapping subgraphs In International Confer-ence on Applied Computing (IADIS 2005)

22 Baumes J Goldberg M Magdon-Ismail M(2005) Efficient identification of overlappingcommunities Intelligence and Security Infor-matics 3495 27ndash36

23 Clauset A (2005) Finding local communitystructure in networks Physical Review E 7226132

24 Lee C Reid F McDaid A Hurley N (2011)Seeding for pervasively overlapping commu-nities Arxiv preprint arXiv11045578

25 Evans T Lambiotte R (2009) Line graphslink partitions and overlapping communitiesPhysical Review E 80 16105

26 Ball B Karrer B Newman M (2011) An ef-ficient and principled method for detectingcommunities in networks ArXiv preprintarXiv11043590

27 Kim Y Jeong H (2011) The map equa-tion for link community Arxiv preprintarXiv11050257

28 Rosvall M Bergstrom C (2008) Maps of ran-dom walks on complex networks reveal com-munity structure Proceedings of the Na-tional Academy of Sciences 105 1118

29 Ghosh S Kane P Ganguly N (2011) Iden-tifying overlapping communities in folk-sonomies or tripartite hypergraphs WWW2011 Poster March 28April 1 2011 Hyder-abad India 39ndash40

30 Havemann F Heinz M Struck A Glaser J(2011) Identification of Overlapping Com-munities by Locally Calculating Community-Changing Resolution Levels Journal of Sta-tistical Mechanics Theory and Experiment2011 P01023

31 Lee C Reid F McDaid A Hurley N(2010) Detecting highly overlapping com-munity structure by greedy clique expan-sion In Proceedings of the 4th SNA-KDDWorkshop ArXiv httparxivorgabs10021827

32 Janssens F Zhang L Moor B GlanzelW (2009) Hybrid clustering for validationand improvement of subject-classificationschemes Information Processing amp Manage-ment 45 683ndash702

33 Blondel V Guillaume J Lambiotte R Lefeb-vre E (2008) Fast unfolding of communi-ties in large networks Journal of Statisti-cal Mechanics Theory and Experiment 2008P10008

1 Introduction

3 Three Approaches


32 Link Clustering


4 Data


42 Three Topics


51 MONC Algorithm




6 Hierarchical Clustering of Citation Links

61 HLC Algorithm on Bipartite Graphs






81 Comparison of the Identified Communities

82 Assumptions Used


9 Discussion