9
Eur. Phys. J. B (2012) 85: 190 DOI: 10.1140/epjb/e2012-30015-5 Regular Article T HE EUROPEAN P HYSICAL JOURNAL B Identifying overlapping communities in social networks using multi-scale local information expansion H.J. Li 1, 2 , J. Zhang 1, 2, 3 , Z.P. Liu 2 , 4 , L. Chen 2 , 4 , 5, a , and X.S. Zhang 1, 2 , b 1 Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, P.R. China 2 National Center for Mathematics and Interdisciplinary Sciences, Chinese Academy of Sciences, Beijing 100190, P.R. China 3 Key Laboratory of Random Complex Structures and Data Science, Chinese Academy of Sciences, Beijing 100190, P.R. China 4 Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200233, P.R. China 5 Collaborative Research Center for Innovative Mathematical Modelling, Institute of Industrial Science, University of Tokyo, 153-8505 Tokyo, Japan Received 6 January 2012 / Received in final form 1st April 2012 Published online 6 June 2012 – c EDP Sciences, Societ`a Italiana di Fisica, Springer-Verlag 2012 Abstract. Most existing approaches for community detection require complete information of the graph in a specific scale, which is impractical for many social networks. We propose a novel algorithm that does not embrace the universal approach but instead of trying to focus on local social ties and modeling multi-scales of social interactions occurring in those networks. Our method for the first time optimizes the topological entropy of a network and uncovers communities through a novel dynamic system converging to a local minimum by simply updating the membership vector with very low computational complexity. It naturally supports overlapping communities through associating each node with a membership vector which describes node’s involvement in each community. Furthermore, different multi-scale partitions can be obtained by tuning the characteristic size of modules from the optimal partition. Because of the high efficiency and accuracy of the algorithm, it is feasible to be used for the accurate detection of community structures in real networks. 1 Introduction Since the publication of the seminal work of Barab´ asi and Albert [1], a lot of real complex systems have been ex- amined from the viewpoint of complex networks. Having been observed to arise naturally in a vast range of physical phenomena, complex networks can describe complex sys- tems containing massive units (or subsystems) with nodes representing the component units and edges standing for the interactions among them. As a typical kind of net- works, the social network is closely related to our life and has been paid a great deal of attention in recent years, for example, World Wide Web [2], traffic networks [3], sexual networks [4], and article cite networks [5]. Among others, the community structure detection of social networks has become a very important issue in re- cent years. Nodes, which belong to a tight-knit commu- nity, are more likely to have particular properties in com- mon. By taking WWW network as an example, groups of web pages are more likely linking to web pages on re- lated topics. These sets of web pages might correspond to some kinds of communities. Therefore search engines may a e-mail: [email protected] b e-mail: [email protected] increase the precision and recall of search results by focus- ing on narrow but topically-related subsets of the web. In fact, several quality functions have been proposed for community structure detection [68]. Among them, modularity Q is proved to be the most popular [810] and has been pursued by many researchers [1113]. However, most of those approaches require knowledge of the entire graph structure to identify global communities based on global information. This constraint is impractical for large complex networks, because it is a challenge to know the whole network completely. Moreover, statistical methods can only detect the most significant connectivity commu- nity patterns and ignore their multi-scale topology. These identifications don’t have the advantage of providing a coarse-grained representation in the system, thereby they can’t sketch its organization or identify the sets of nodes which are likely to have hidden functions or properties in common. Because of these limitations, we present a novel algo- rithm for community detection focused on social networks in this paper. The algorithm does not embrace the univer- sal approach instead of trying to focus on social networks using local information and modeling the multi-scale so- cial interaction patterns occurring in those networks. Our method optimizes the topological entropy that represents

Identifying overlapping communities in social networks using multi-scale local information expansion

  • Upload
    x-s

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Eur. Phys. J. B (2012) 85: 190DOI: 10.1140/epjb/e2012-30015-5

Regular Article

THE EUROPEANPHYSICAL JOURNAL B

Identifying overlapping communities in social networks usingmulti-scale local information expansion

H.J. Li1,2, J. Zhang1,2,3, Z.P. Liu2,4, L. Chen2,4,5,a, and X.S. Zhang1,2,b

1 Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, P.R. China2 National Center for Mathematics and Interdisciplinary Sciences, Chinese Academy of Sciences, Beijing 100190, P.R. China3 Key Laboratory of Random Complex Structures and Data Science, Chinese Academy of Sciences, Beijing 100190, P.R. China4 Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences,

Shanghai 200233, P.R. China5 Collaborative Research Center for Innovative Mathematical Modelling, Institute of Industrial Science, University of Tokyo,

153-8505 Tokyo, Japan

Received 6 January 2012 / Received in final form 1st April 2012Published online 6 June 2012 – c© EDP Sciences, Societa Italiana di Fisica, Springer-Verlag 2012

Abstract. Most existing approaches for community detection require complete information of the graphin a specific scale, which is impractical for many social networks. We propose a novel algorithm thatdoes not embrace the universal approach but instead of trying to focus on local social ties and modelingmulti-scales of social interactions occurring in those networks. Our method for the first time optimizes thetopological entropy of a network and uncovers communities through a novel dynamic system convergingto a local minimum by simply updating the membership vector with very low computational complexity.It naturally supports overlapping communities through associating each node with a membership vectorwhich describes node’s involvement in each community. Furthermore, different multi-scale partitions canbe obtained by tuning the characteristic size of modules from the optimal partition. Because of the highefficiency and accuracy of the algorithm, it is feasible to be used for the accurate detection of communitystructures in real networks.

1 Introduction

Since the publication of the seminal work of Barabasi andAlbert [1], a lot of real complex systems have been ex-amined from the viewpoint of complex networks. Havingbeen observed to arise naturally in a vast range of physicalphenomena, complex networks can describe complex sys-tems containing massive units (or subsystems) with nodesrepresenting the component units and edges standing forthe interactions among them. As a typical kind of net-works, the social network is closely related to our life andhas been paid a great deal of attention in recent years, forexample, World Wide Web [2], traffic networks [3], sexualnetworks [4], and article cite networks [5].

Among others, the community structure detection ofsocial networks has become a very important issue in re-cent years. Nodes, which belong to a tight-knit commu-nity, are more likely to have particular properties in com-mon. By taking WWW network as an example, groupsof web pages are more likely linking to web pages on re-lated topics. These sets of web pages might correspond tosome kinds of communities. Therefore search engines may

a e-mail: [email protected] e-mail: [email protected]

increase the precision and recall of search results by focus-ing on narrow but topically-related subsets of the web.

In fact, several quality functions have been proposedfor community structure detection [6–8]. Among them,modularity Q is proved to be the most popular [8–10] andhas been pursued by many researchers [11–13]. However,most of those approaches require knowledge of the entiregraph structure to identify global communities based onglobal information. This constraint is impractical for largecomplex networks, because it is a challenge to know thewhole network completely. Moreover, statistical methodscan only detect the most significant connectivity commu-nity patterns and ignore their multi-scale topology. Theseidentifications don’t have the advantage of providing acoarse-grained representation in the system, thereby theycan’t sketch its organization or identify the sets of nodeswhich are likely to have hidden functions or properties incommon.

Because of these limitations, we present a novel algo-rithm for community detection focused on social networksin this paper. The algorithm does not embrace the univer-sal approach instead of trying to focus on social networksusing local information and modeling the multi-scale so-cial interaction patterns occurring in those networks. Ourmethod optimizes the topological entropy that represents

Page 2 of 9 Eur. Phys. J. B (2012) 85: 190

the statistic significance of a network for the first time. Wenotice that the topological entropy function is not convexand it is unrealistic to expect a standard optimization al-gorithm to find the global minimum, we develop a noveldynamic system which converges to a local minimum bysimply updating the membership vector with low compu-tational complexity. It doesn’t need to specify the numberof communities in advance. The algorithm naturally sup-ports overlapping communities by associating each nodewith a membership vector describing node’s involvementin each community. Theoretical analysis and experimentsshow that the algorithm can uncover communities fast andaccurately.

The outline of this paper is as follows. Section 2 intro-duces the problem of community detection in social net-works and the motivation behind the algorithm. In Sec-tion 3, we present our algorithm through four steps andexplain each one respectively. In Section 4, we analyzesome important properties of our algorithm and in Sec-tion 5, experiments are provided for the benchmark net-work and some well-known social networks. Finally, weconclude this paper in Section 6.

2 Motivation

Given a network G = (V, E) contains n nodes, suppose wecan divide them into a groups. For each group suppose wecan select a “leader”. The leaders should have two prop-erties: they are well connected to the members of theirgroup, and they are able to communicate with other lead-ers when necessary. If the distributed algorithm is carriedout in each group separately and the leaders communicateon a higher level, the nodes can enjoy faster convergencerate.

It is natural to relate social networks with hierarchi-cal structure. In one such hierarchy there are leader nodesthat are more important than some other nodes, hence lo-cated on a higher level in the hierarchy. By taking the DNSnetwork in WWW [2] as an example, the route server isa natural leader and locates on the highest hierarchy (seeFig. 1) when searching IP address. Since the hierarchiesare consequence of the spreading of its correlation, andso are the communities, we believe that the identificationof these hierarchies in a network will result in a naturalcommunity detection. The area on which a leader has mostinfluence should define its community. So, community de-tection is performed by finding all natural leaders and allnodes on which they have influence. Partitions obtainedin this way can be naturally explained. Also, another intu-itive property that a community should possess is satisfiedthis way, that is shortest paths exist between nodes froma same community.

Given a graph, individual nodes only have local knowl-edge about its structure, which include information abouttheir neighboring nodes. If any node wants to improveits own performance, it needs to know more about theglobal picture of the network. This information can beused by the node to refine its choice of neighbors in orderto improve its performance. However, this will cost a lot of

Fig. 1. (Color online) Hierarchical structure of DNS networkwith IP “www.abc.co.com”. The most influential node, Routeserver, is located on a highest level in the hierarchical tree.The servers that include “www”, “abc”, “co” and “com” arelocated at lower levels. To obtain the IP address, users need tomake an inquiry from the highest level Route server to lowestwww server. Node size depicts different levels in the hierarchywith the bigger node locating at the higher level.

computational complexity. The most complete measure ofglobal graph structure is the adjacency matrix. Since eachnode has limited memory, energy, and computational ca-pacity, it will be difficult to directly use the adjacency ma-trix. Our goal is to devise a scheme to provide each nodewith a small vector that includes compact global informa-tion on how the node is located with respect to the othernodes. It is desired that the scheme can be disseminatedvia an implementable distributed manner.

Moreover, a powerful method uncovering the modulesin social networks should use a multi-scale way [14,15].This identification has the advantage of providing acoarse-grained representation of the system, thereby al-lowing to sketch its organization and to identify sets ofnodes that are likely to have hidden functions or proper-ties in common. Most community detection methods finda partition of the nodes into communities, where eachnode is assigned to one and only one community, i.e.,partitions are not compatible with overlapping commu-nities [16,17]. At the heart of most partitioning methods,there is a mathematical definition for what is thought tobe a good partition. Once this quality function has beendefined, different types of heuristics can be used in or-der to find, approximatively, its optimal partition, i.e., tofind the partition having the highest value of the qualityfunction.

3 The algorithm

For a network G = (V, E) with n nodes, we develop adistributed algorithm, which can categorize the node as“leader” or “regular” using local information. Further, themethod assigns each regular node with a membership vec-tor in multi-scale way, indicates that leaders has multipleinfluence on it. This enables the nodes to reflect someglobal picture of the network. The iteration includes threesteps described as follows.

Eur. Phys. J. B (2012) 85: 190 Page 3 of 9

3.1 Leadership of nodes

First, we calculate the leadership fi of every node i in thenetwork. The leadership fi represents how important isthe opinion of node i in the network. Let the node leader-ship function defined as:

f(i) =n∑

j=1;dij≤� 3δ√2�e−

dijδ (1)

where dij is the shortest distance from vertex i to vertexj. δ ∈ (0, +∞) is the influence factor which is used tocontrol mutual action range between nodes. According tothe properties of exponential function e−

dijδ , for a special

value of δ, the influence range of every node to other nodesis approximately � 3δ√

2�. When dij larger than � 3δ√

2�, the

value of exponential function rapidly reduce to 0, so wecan use δ to control the influence range of a node andcalculate f(i) only within the range dij ≤ � 3δ√

2�. For the

dense region of a network, nodes have higher leadership.The nodes with largest leadership mean they have mostamount of links with other nodes and can be viewed ascandidates of leader nodes. Therefore, we can use nodeleadership to represent the importance of a node in thenetwork.

3.2 Identifying the leader nodes

Identifying the leader nodes of the community is very im-portant to analyze the properties of complex networks.Many ways can be used to define the “key node”, such asthe nodes with largest degree or betweenness centrality.Here, we use node leadership to search leader nodes. Ac-cording to the notion of community structure, the densityof inner-community links is larger than the rest of nodes.Each community represents a local region with relativehigher correlation and the leader node of the communityhas the highest leadership and is tightly linked by othernodes. Moreover, different communities are divided by lo-cal lowest leadership nodes – the boundary nodes.

Note that in the rare cases where two or more leadersare also most influential neighbors between each other,then they are grouping together and are becoming leadersof one group. For example, in a full connected network,all of the nodes are leaders of one community, whereas fora ring network, each node is a leader to its own commu-nity. Specifically, if the shortest distance of two highestleadership nodes less than � 3δ√

2�, we group them together

and consider in the same group. Finding leader nodes onlyneeds a simple breadth first search and if found, we choosea random node to restart this process until converge. Thecomputational complexity is O(m), where m is the num-ber of edges in the network.

3.3 Determining the membership using random walk

At this step, our goal is to devise a scheme to provide eachnode with a small vector that includes compact global in-

formation on how the node is located with respect to theother nodes. We provide a definition for the membershipvector based on the properties of random walk dynamicon graphs. Consider a graph with a leaders l1, l2, . . . , laand n − a regular nodes. Given the leaders and the arbi-trary order assigned to them, we describe the algorithmto determine the membership vectors for each regularnode. We denote the membership vector of node i byxi = (x1

i , x2i , . . . , x

ai ) ∈ Ra. By xk

i (t), we mean the kthentry of the influence vector of node i evaluated at time t.

The procedure operates as follows. The membershipvector of leader li is first assigned to be the unit vector.These a vectors do not vary. For regular node i, xk

i isinitialized randomly, distributed uniformly on [0, 1] (k =1, 2, . . . , a). Then we normalize each row of xi so that forall leader k, the sum of xk

i is 1. At each iteration timet, the influence vector of each regular node i is updatedentry-wise (k = 1, 2, . . . , a) using the following rule:

xki (t + 1) =

1∑j aij + 1

⎣xki (t) +

j

aijxkj (t)

⎦ (2)

where A = {aij} is the adjacency matrix in which aij = 1if node i and j are connected and aij = 0 otherwise.

We notice that, for all time t,∑

k xki (t) = 1. Equa-

tion (2) is equivalents to X(t + 1) = PX(t) = (I +D)−1(A + D)X(t), where P = (I + D)−1(A + D) is astochastic walk matrix. Actually, the influence of leadernodes lk(k = 1, 2, . . . , a) on any regular node i, xk

i , is theprobability that a random walker that starts from i hits lkbefore it hits any other leader node [18]. If the underlyinggraph is connected, the iteration limt→∞ xi(t) convergesto a set of unique vectors and these vectors can naturallybe represented as the probability a regular node belongsto the community that a given leader node in. As a result,although leadership of a node only contain local informa-tion, we can use random walk dynamic to get membershipcontaining a global view of the whole graph.

4 Some descriptions of the algorithm

In this section we describe several important properties ofthe algorithm, including computing the influence factor δto recognize multi-scale communities, identifying the lead-ers using local information, determining the overlappingnodes and estimating the complexity of the algorithm.

4.1 Determining the influence factor δ to recognizemulti-scale communities

According to the definition of leadership, the algorithmis controlled by only one parameter, the influence factorδ. We can naturally use δ to control the scale of commu-nity structure detected by our method. Here, we introducetopological entropy H [19,20] that represents the statisticsignificance of a network to choose suitable δ: for networkG = (V, E), V = v1, v2, . . . , vn, the leadership of V are

Page 4 of 9 Eur. Phys. J. B (2012) 85: 190

(a)

(b)

Fig. 2. (Color online) (a) A simple network with eleven nodes.(b) Plot of topological entropy H versus influence factor δ.

f(1), f(2), . . . , f(n), the topological entropy is defined as:

H = −n∑

i=1

f(i)∑ni=1 f(i)

log[

f(i)∑ni=1 f(i)

]. (3)

Small H means a stable and suitable partition. For a sim-ple example, we consider a network containing 11 nodesshown in Figure 2a and calculate topological entropy cor-responding to different δ. As shown in Figure 2b, whenδ increases from 0, the corresponding entropy begins todecrease and reaches minimal 2.2805 at s specific value ofδ (δ = 1.26). When δ leaves from optimal value, entropybegins to increase with δ and finally reaches the maximalvalue.

Therefore, to find a optimal δ, it is equivalent to min-imize the single parameter nonlinear function H(δ) andmany algorithms can be used, for example, the randomsearch algorithm and the simulated annealing algorithm.However, δ corresponding to a small value of H but notminimal is also meaningful. Specially, according to theproperty of leadership, the influence range of a node isapproximately � 3δ√

2�. When 0 < δ <

√2/3, there is no

interaction between two nodes. Because no interaction ex-ists, every node belongs to the community contains it-self and the number of communities is n. Similarly, when√

2/3 < δ < 2√

2/3, a node only interacts with its neigh-borhood. As the value of δ grows, nodes can influencemore and more nodes and thus the number of leaders and

(a)

(b)

(c)

Fig. 3. (Color online) RB 125 corresponds to the hierarchicalscale-free network. (a) and (b) corresponding to 25, 5 modulesare the most reasonable partitions in terms of resolution withH = 3.107 and 3.352, respectively. (c) Plot of the number ofcommunities versus the value of δ.

communities decreases. Finally, as δ ≥ √D/3, D is the

diameter of the network, every pair of nodes can influenceeach other no matter how far they are.

To show our method can discover multi-scale commu-nity structure with the variation of δ, we have tested themulti-scale modular structure in a classical hierarchicalscale-free network with 125 nodes, RB125, proposed byRavasz and Barabasi [21]. In Figures 3a and 3b we plot the

Eur. Phys. J. B (2012) 85: 190 Page 5 of 9

modular structure found with minimal entropy H = 3.107and another small value H = 3.352, which shows two dif-ferent scales that deserve discussion. The value of H versusdifferent δ is plotted in Figure 3c. We observe clearly per-sistent structures in 25 and 5 communities respectively,that accounts for the subdivisions more significant in theprocess, showing two hierarchical levels for the structure.The partition in 25 modules and the partition in 5 mod-ules are highlighted on the original network.

4.2 Determining the leaders using local information

In the algorithm, leadership of one node can be deter-mined through only local information. By detecting theleader in a community we gain very useful information ofthe most influential node in its community. By removingthe leader it can be expected for the community to suf-fer serious consequences, like splitting into several smallercommunities. The leader’s hierarchy, or the leader’s com-munity, is the area where the leader’s opinion is the mostinfluential opinion. For example, this can be used for animmunization for epidemic spreading. Thus, the algorithmcan naturally determine the number of leaders, that isalso, the number of communities. One interesting featureof the algorithm is that although it automatically detectsthe best leaders, one can manually specify particular nodesas leaders and build community structures around them.

4.3 Determining the overlapping nodes

It is worthwhile to point out that the vast majority ofcommunity detection methods assume that communitiesof complex networks are disjoint, placing each node inonly one non-overlapping cluster. Generally, we call thesemethods “hard-partition” algorithms. However, in manyreal networks communities often overlap to some extent.An important property of our algorithm is the compu-tation of a membership vector for each node. Instead ofhaving one number denoting its membership in a singlecommunity, we have a percentage for each community. Asa result, we can easily identify nodes that naturally be-long to more than one community known as overlappingnodes [17,22,23]. So our method is a “soft-partition” al-gorithm. Additionally, we can find nodes that are goodfollowers of their leader, and also nodes that have no dis-tinguished leader and serve as a proxy between severalcommunities.

4.4 Computational complexity

The overall complexity of the algorithm depends on thehighest complexity of the three parts of the algorithm. Inthe following we analyze each of them sequentially.

The first step is calculating node’s leadership f(i). Weneed to calculate the exponential function within length ofshortest path dij ≤ � 3δ√

2� between pair of nodes and the

complexity of this procedure is at least O(m), m is the

number of links. Actually, the computational complexityis worst O(n2), for a dense graph. Next step, determiningthe leader nodes of communities, is proceeded by searchingall local highest leadership nodes. This can be done by asimple breadth first search and the complexity is O(m).The last operation is very similar to the consensus linearprocess, whose complexity is O(n) similar to the randomwalk process.

To conclude this section, the one with the highest com-putational complexity is the first step, i.e., calculating theleadership of nodes. Its complexity depends on the de-gree of connectivity and the graph which is very denselyconnected needs more complexity. This accounts for theoverall complexity of the algorithm is O(m) at best andO(n2) at worst.

5 Experiments

In this section, we respectively apply the algorithm tosimulated benchmark networks (LFR networks) [24] andsome real social networks: the karate club network ofZachary [25], the scientific collaboration network [26] andfinally a large scale semantic network [27]. Results showthat the algorithm can discover multi-scale communitiesefficiently and accurately.

5.1 The benchmark network

We empirically demonstrate the effectiveness of the al-gorithm through comparison with other five well-knownalgorithms on the artificial benchmark networks. Thesealgorithms include: Newman’s fast algorithm [6], Danonet al’s method [9], the Louvain method [28], Infomap [29]and the clique percolation method [16]. We utilize theLFR benchmark proposed by Lancichinetti and Fortunatoin [24]. This benchmark provides networks with scale-freedistributions of node degrees and community sizes andthus it is a much more severe test to community detectionalgorithms than standard benchmarks. Many parametersare used to control the generated networks in this bench-mark: the number of nodes N , the average node degree〈k〉, the maximum node degree maxk, the mixing ratioμ (each vertex shares a fraction μ of its edges with ver-tices in other communities), the minimum community sizeminc and the maximum community size maxc. The valueof μ varies within �0, 1� and determines the level of thefuzziness of the communities in the network. The largerthe μ, the more fuzzy the communities. In our test, weuse the default parameter configuration where N = 1000,〈k〉 = 15, maxk = 50, minc = 20 and maxc = 50.

To evaluate a community detection algorithm, the nor-malized mutual information (NMI) measure [30] is cal-culated for the partition found by each algorithm. Thetest focuses on whether the intrinsic scale can be cor-rectly uncovered. The experimental results are displayedin Figure 4, where y-axis represents the value of NMI, andeach point in curves is obtained by averaging the valuesobtained on 50 synthetic networks sampled from above

Page 6 of 9 Eur. Phys. J. B (2012) 85: 190

0 0.1 0.2 0.3 0.4 0.5 0.60.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

μ

NM

I

Fast NewmanDanonLouvainInformapThis MethodClique Percolation

Fig. 4. (Color online) The comparison of NMI with six algo-rithms.

model. As we can see, all algorithms work very well whenμ is less than 0.3 with NMI larger than 0.85. Comparedwith other five algorithms, our algorithm performs quitewell and its accuracy is only slightly worse than that ofthe clique percolation in the case of 0.35 ≤ μ ≤ 0.5. How-ever, clique percolation is nearly same as the Breath FirstSearch (BFS) and very time consuming. The complexityof clique percolation is almost O(n3) and much larger thanour method.

As real networks may have some different topologicalproperties from synthetic ones, in the following we con-sider several widely used real-world networks to furtherevaluate the performance of our method.

5.2 The karate club network of Zachary

Over the course of two years in the early 1970s, WayneZachary observed social interactions between the mem-bers of a karate club at an American university [25]. Heconstructed networks of ties between members of the clubbased on their social interactions both within the club andaway from it. By chance, a dispute arose during the courseof his study between the club’s administrator and its prin-cipal karate teacher over whether to raise club fees, andas a result the club eventually split into two, forming twosmaller clubs, centered around the administrator and theteacher.

We minimize the function of H(δ) and get the opti-mal value of δ = 1.85 and H = 3.914. As it is shownin Figure 5a, the partition found by our algorithm notonly matches the original partition, but also identifies theexact leaders. Nodes 1 and 33 own the local highest lead-ership, which respectively represent the administrator andthe teacher. In this instance node 3 is detected as an over-lapping node because its membership belonging to twocommunities is nearly equal. Actually, node 3 is on theborder between the communities and so it is understand-able that it might be an ambiguous case.

Compared with the optimal situation, when decreasingδ to 1.41, the entropy H = 4.139 is also very small. Thecommunity structure detected in this situation is shownin Figure 5b, which reveals another scale of relationshipsamong the members of the karate club. Node 28 becomes

(a)

(b)

(c)

Fig. 5. (Color online) The community structure of the karateclub network detected when δ (a) equals to the optimal value1.85, (b) decreases to 1.41 from optimal, and (c) further de-creases to 0.933. In subgraphs (a)–(c), communities are repre-sented by different shapes and overlapping nodes are enclosedin dashed curves.

another local highest leadership node and four most un-stable nodes including nodes 3, 10, 20, 28 are markedin a dashed curve. Such members have good friendshipwith more than one clubs at the same time, so they areoverlapping nodes in this situation. And now the numberof communities detected in the karate network is three.Furthermore, as decreasing δ to 0.93 H becomes 4.436,

Eur. Phys. J. B (2012) 85: 190 Page 7 of 9

(a)

(b)

Fig. 6. (Color online) The community structure of the scien-tific collaboration network obtained when δ (a) equals to theoptimal value 1.493, (b) is amplified to 1.749 from optimal. Inboth subgraphs (a) and (b), overlapping nodes are enclosed indashed curves.

we get a partition with 4 communities shown in Figure 5c.This partition is identical to [31] described by Newman.Six overlapping nodes are detected which constitute thefuzzy boundaries of the communities. Thus, partitions us-ing different scales of δ are able to reflect multi-scale prop-erty of the real networks.

5.3 The scientific collaboration network

The scientific collaboration network was collected byGirvan and Newman [26] and has been examined in ref-erences [17,32]. This network consists of 118 nodes (sci-entists or authors), and edges between them indicate co-authorship of one or more papers appearing in the archive.The collaborative ties represented in the figure are not lim-ited to papers on topics concerning networks – we wereinterested primarily in whether people know one another,and collaboration on any topic is a reasonable indicatorof acquaintance.

The present method detects eight communities withoptimal δ = 1.493 and minimal entropy H = 5.447. Fig-ure 6a shows the community structure detected at optimal

situation which is exactly same as references [17,26]. Thisconfirms our partition as a good one. However, we believeour method can also make a meaningful “coarse-grained”partition which is visually reasonable. So the value of δis amplified to 1.749 from optimal and the correspondingentropy H = 6.483. Owing to the the amplification of theinfluence range of nodes with δ, the number of communi-ties decreases. From Figure 6b, we notice some “uninfluen-tial” communities, like the light blue and yellow ones, aremerged by the more powerful red and dark green commu-nities, respectively. Finally we get six communities whichcan be interpreted readily by the human eye. These multi-scale partitions will be invaluable in helping us to under-stand the large-scale structure of these network data. Fur-thermore, overlapping nodes enclosed in dashed curves inFigure 6 are detected according to their membership vec-tors. These nodes generally locate on the borders of twoor more communities and represent authors with multipleresearch interests or cross-discipline background. Maybesuch nodes play a role in bridging two or more communi-ties in a complex network of other types. The ability tofind overlapping nodes is a distinguished feature of ourmethod and useful to reveal a natural characteristic inmany social networks.

5.4 A large scale semantic network

The semantic network from reference [27] contains7207 phrases and 31 784 edges. The weights of edges arecalculated in terms of phrase co-occurrences. For visual-ization purpose, our algorithm outputs a transformed ad-jacency matrix (in which the vertices within the same com-munities have been arranged together) with a hierarchicalcommunity structure. The output matrix is shown in Fig-ure 7a. The cumulative distribution of community sizes ina power plot is shown in Figure 7b. Totally, 569 communi-ties are detected by setting optimal δ = 2.931 and minimalentropy H = 5.952. The maximum size of community is139, the minimum size is 2, and the average size is 12.57.One can see an approximate power-law phenomenon, thatis, most communities are small and only a few are big.Among them, we have selected four interesting communi-ties listed as follows:

Community 1 = {Scientist, Inventor, Genius, Gifted,Brilliant, Intelligent, Smart, Science, Intelligence, Musi-cian};

Community 2 = {Violin, Instrument, Cello, Band,Tuba, Clarinet, Orchestra, Trumpet, Trombone, Oboe,Woodwind, Symphony, Flute, Bass, Viola, Fiddle};

Community 3 = {Ovation, Sitting, Low, Descent,Up, Step, Ascend, Elevator, Ascent, Staircase, Stairwell,Climb, Steps, Ladder, Stairs, Wake, Stairway, Rise, Esca-lator, Stair, Down, Standing, Resting, Using};

Community 4 = {Nails, Hammer, Carpenter, Screw,Screwdriver, Tool, Pliers, Wrench, Sickle, Mechanic,Phillips}.

These four communities are all reasonable moduleslisted in reference [27] and the elements of each are all have

Page 8 of 9 Eur. Phys. J. B (2012) 85: 190

(a)

(b)

Fig. 7. (a) The transformed adjacent matrix output by ouralgorithm. (b) The cumulative distribution of community sizein a power plot.

same meaning. Among these elements, {Musician, Intelli-gence} are uncovered as overlapping nodes between com-munities 1 and 2, and {Using, Tool, Mechanic} are theoverlapping nodes between communities 3 and 4. We caneasily recognize that these overlapping phrases have fuzzymeanings and have high value of phrase co-occurrences.

As the inherent community structure for this large se-mantic network is usually unknown, it is worth to makeuse of a measure to quantitatively evaluate the perfor-mance of our method. Here the popular modularity Q [7,8]is adopted as a reference, which was proposed by Newmanand Girvan and has been heavily used for community de-tection in recent years. Q is defined as:

Q =c∑

i=1

[liniL

−(

di

2L

)2]

, (4)

here, c is the number of communities, L is the total num-ber of edges in the network, and lini and di = 2lini + lout

iare the number of edges and the sum of vertex degrees inthe ith community, respectively. Figure 8 shows the resultthat compares modularity Q with the topological entropy

Fig. 8. The comparison of Modularity Q with topological en-tropy H across multi-scale δ.

H across multi-scale of δ. As we can see, the main trendis that the lower value of H , the larger value of Q. Whenδ reaches the optimal value H = 5.952, the ModularityQ also reaches the maximal Q = 0.521 exactly. The re-sult shows the community structure of the network corre-sponding to a certain δ is strong and robust. In conclusion,our algorithm can uncover the most suitable communityscale effectively on real-world networks.

6 Conclusion

In summary, we have presented a novel community de-tection method based on local information in social net-works. The algorithm does not embrace the universal ap-proach but tries to focus on local social ties and modelmulti-scales of social interactions that occur in those net-works. It identifies leaders and then detects communitieslocated around the leaders using random walk dynamic.Our method not only supports overlapping communitiesdetection using a membership vector to denote node’s in-volvement in each community, but can also describe differ-ent multi-resolution clusters allowing to discover “coarse-grained” modules versus the optimal partition. Applyingour algorithm to several typical real-world networks withwell defined community structures, we obtained reason-able results. So this method is feasible to be used for theaccurate detection of community structures in complexnetworks. To sum up, from a new perspective, we proposea new community detection algorithm based on local in-formation in this paper. The computational results on realsocial networks show that the new method not only candetect the accurate communities but also can extract thehierarchical structures of the networks.

We are grateful to the anonymous reviewers for their valu-able suggestions which are very helpful for improving themanuscript. The authors are separately supported by NSFCGrants 11131009, 60970091, 61171007, 91029301, 61072149,

Eur. Phys. J. B (2012) 85: 190 Page 9 of 9

31100949, 61134013 and Grants kjcx-yw-s7 and KSCX2-EW-R-01 from CAS. This research is also partially supported byShanghai Pujiang Program and the Aihara Project of theFIRST program from JSPS initiated by CSTP.

References

1. A.L. Barabasi, R. Albert, Science 286, 509 (1999)2. R. Albert, A.L. Barabasi, H. Jeong, Nature 401, 130

(1999)3. X.G. Li, Z.Y. Gao, K.P. Li, X.M. Zhao, Phys. Rev. E 76,

016110 (2007)4. F. Liljeros, C.R. Edling, L.A.N. Amaral, H.E. Stanley, Y.

Aberg, Nature 411, 907 (2001)5. A. Sumiyoshi, S. Norikazu, Phys. Rev. E 74, 026113 (2006)6. M.E.J. Newman, Phys. Rev. E 69, 066133 (2004)7. M.E.J. Newman, M. Girvan, Phys. Rev. E 69, 026113

(2004)8. M.E.J. Newman, Proc. Natl. Acad. Sci. 103, 8577 (2006)9. L. Danon, J. Duch, D. Guilera, A. Arenas, J. Stat. Mech.

29, P09008 (2005)10. X.S. Zhang, R.S. Wang, Y. Wang, J. Wang, Y. Qiu, L.

Wang, L. Chen, Europhys. Lett. 87, 38002 (2009)11. A. Clauset, M.E.J. Newman, C. Moore, Phys. Rev. E 70,

066111 (2004)12. M.E.J. Newman, Phys. Rev. E 74, 036104 (2006)13. Z.P. Li, S.H. Zhang, R.S. Wang, X.S. Zhang, L. Chen,

Phys. Rev. E 77, 036109 (2007)14. T. Evans, R. Lambiotte, Eur. Phys. J. B 77, 265 (2010)

15. P.J. Mucha, T. Richardson, K. Macon, M.A. Porter, J.P.Onnela, Science 328, 876 (2010)

16. G. Palla, I. Derenyi, I. Farkas, T. Vicsek, Nature 435, 814(2005)

17. H.J. Li, Y. Wang, L.Y. Wu, Z.P. Liu, L. Chen, X.S. Zhang,Europhys. Lett. 97, 48005 (2012)

18. J.S. Baras, P. Hovareshti, Proceedings of 47th IEEEConference on Decision and Control (2008), pp. 2973–2978

19. D. Gfeller, J.C. Chappelier, P. De Los Rios, Phys. Rev. E72, 056135 (2005)

20. G. Bianconi, P. Pin, M. Marsili, Proc. Natl. Acad. Sci.106, 11433 (2009)

21. E. Ravasz, A.L. Barabasi, Phys. Rev. E 67, 026112 (2003)22. D.B. Chen, M.S. Shang, Y. Fu, Physica A 389, 4177 (2010)23. M.S. Shang, D.B. Chen, T. Zhou, Chin. Phys. Lett. 27,

058901 (2010)24. A. Lancichinetti, S. Fortunato, Phys. Rev. E 80, 016118

(2009)25. W.W. Zachary, J. Anthropol. Res. 33, 452 (1977)26. M. Girvan, M.E.J. Newman, Proc. Natl. Acad. Sci. 99,

7821 (2002)27. G. Palla, A.L. Barabasi, T. Vicsek, Nature 446, 664 (2007)28. V.D. Blondel, J.L. Guillaume, R. Lambiotte, E. Lefebvre,

J. Stat. Mech. 10, P10008 (2005)29. M. Rosvall, C.T. Bergstrom, Proc. Natl. Acad. Sci. 105,

1118 (2008)30. A. Lancichinetti, S. Fortunato, Phys. Rev. E 80, 056117

(2009)31. M.E.J. Newman, Eur. Phys. J. B 38, 321 (2004)32. J. Zhang, S. Zhang, X.S. Zhang, Physica A 387, 1675

(2008)