Fast Hierarchical Graph Clustering · This study takes also advan-tage of Webgraph, a framework for succinct representation of graphs, and explores how clustering can be ... Hierarchical

Fast Hierarchical Graph Clustering

Miguel Martins Duarte

Thesis to obtain the Master of Science Degree in

Informatics Engineering

Supervisors: Prof. Alexandre Paulo Lourenço FranciscoProf. Pedro Manuel Pinto Ribeiro

Examination Committee

Chairperson: Prof. José Carlos Alves Pereira MonteiroSupervisor: Prof. Alexandre Paulo Lourenço Francisco

Member of the Committee: Prof. Francisco João Duarte Cordeiro Correia dos Santos

October 2017

2

Acknowledgements

Firstly, I would like to thank my thesis advisors, Professor Alexandre Francisco and Professor Pedro

Ribeiro for their help and guidance.

I’m also grateful to my friends and colleagues, namely Pedro Paredes, Andre Fonseca and Samuel

Gomes, who were always available to share their experience and understanding.

Finally, I would like to thank my parents and brother, for their encouragement and support along the

way.

3

4

Resumo

Este trabalho estuda como e que metodos de clustering de grafos (ou pesquisa de comunidades em grafos)

podem ser usados para resolver problemas em grafos, tais como clustering hierarquico e reordenacao.

Neste contexto, explora-se como e que o Faststep [1], um metodo recente para factorizacao de matrizes e

clustering de grafos, pode ser adaptado para resolver estes problemas.

Desenvolveu-se um novo metodo, baseado no Faststep, e comparou-se o algoritmo proposto com

metodos actuais de clustering : Layered Label Propagation e metodo de Louvain. Os resultados avaliaram-

se num conjunto alargado de testes, incluido comparacao com comunidades reais conhecidas, redes geradas

por modelos aleatorios e comparacao directa de algoritmos. Este estudo toma partido do Webgraph, uma

ferramenta para representacao sucinta de grafos, e explora como as comunidades podem ser usadas para

melhorar as taxas de compressao obtidas. Concluiu-se que o Faststep, apesar de desenhado para resolver

um problema diferente, pode ainda assim obter resultados aceitaveis neste. Mostrou-se tambem que o

metodo de Louvain, um metodo de clustering hierarquico, pode de facto ser bastante promissor na tarefa

de reordenacao de grafos.

Palavras-chave: Ciencia de Redes, Redes Complexas, Clustering de grafos, Clustering hier-

arquico de grafos, Compressao de grafos

5

6

Abstract

This work studies how graph clustering methods can be used to solve other graph problems, such as

hierarchical clustering and graph reordering. In this context, it explores how Faststep [1], a recent

method for matrix factorization and graph clustering, can be adapted to solve these other problems.

Our study develops a new method for this task, based on Faststep, and compares the proposed algo-

rithm with state-of-the-art clustering methods: Layered Label Propagation and Louvain method. The

results obtained are evaluated by several tests, which include comparison with ground-truth communities,

networks generated by random models and direct comparison of algorithms. This study takes also advan-

tage of Webgraph, a framework for succinct representation of graphs, and explores how clustering can be

a powerful tool to improve the compression rates. We conclude that Faststep, although designed to solve

a different problem, can still obtain acceptable results for this one. We also show that Louvain method, a

hierarchical clustering method, can in fact obtain really promising results in the graph reordering task.

Keywords: Network Science, Graph Clustering, Hierarchical Graph Clustering, Graph Com-

pression

7

8

Contents

Resumo 5

Abstract 7

List of Tables 11

List of Figures 12

1 Introduction 15

1.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.2 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.3 Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 Background 19

2.1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1.1 Graph representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 FastStep Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2.1 FastStep formal goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2.3 Obtaining the clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3 Label Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3.1 Extensions to Label Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.4 Modularity Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4.3 Louvain Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.5 Classical Hierarchical Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.5.1 Betweenness-based divisive algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Approach 31

3.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1.1 Initial validation of the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1.2 Comparison Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1.3 Succinct representation as clustering metric . . . . . . . . . . . . . . . . . . . . . . 34

3.2 Graph Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.1 Faststep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.2 Louvain Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2.3 Layered Label Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3 Graph Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

9

3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4.1 Recursive Faststep implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4.2 Command Line tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4 Results 41

4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2 Detection of communities in artificial networks . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 Detection of communities in networks with ground-truth communities . . . . . . . . . . . 43

4.4 Comparison of clusterings obtained with different methods . . . . . . . . . . . . . . . . . . 47

4.5 Compression Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5 Final Remarks 55

References 57

A Shannon Entropy 61

10

List of Tables

1 Successor lists for graph 1a, using Webgraph format. . . . . . . . . . . . . . . . . . . . . . 21

2 How to use the fhgc tool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3 Datasets used: source and information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Size of datasets after reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5 Space savings after reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6 Execution time of the algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

11

12

List of Figures

1 Examples of graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 Adjacency matrix of a community found by FastStep. . . . . . . . . . . . . . . . . . . . . 21

3 Faststep values for a real network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4 Benchmark results for the NMI metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Benchmark results for the Jaccard metric . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6 Detection of ground-truth communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7 Detection of ground-truth communities: relation of the community size with detection

quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

8 Clustering comparison using NMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

9 Clustering comparison using Jaccard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

13

14

1 Introduction

1.1 Problem

Graph clustering or community finding is an important task to further understand real networks. Com-

munity finding can be seen naturally as finding groups of vertices in a network with denser connections

between themselves and sparser connections with other elements of the network. This problem is compu-

tationally hard in general, and several greedy methods, with origins in multiple fields, have been proposed

to solve it. These algorithms are usually based on general assumptions on community structure and are

evaluated using ground-truth clusterings, i.e., graphs where communities are already known.

For a more formal goal, we can consider quality measures, with the most commonly used being the

modularity [2]. Some algorithms are designed specifically to maximize some metric. For a more complete

discussion on community finding we refer the reader to a review by Fortunato [3].

The complexity of the problem depends on the specific definition of clustering and it is tightly related

with the quality measure we are optimizing. In general, this optimization is an NP-Hard problem [4].

Also, it is not usually possible to obtain an approximate solution which is within defined quality bounds.

Hierarchical graph clustering is a similar problem, but in this case we are looking for a hierarchy of

clusters, i.e., how larger communities can be divided into smaller ones, and how each of these smaller

ones can also be divided, and so on until we reach an undivisible community. Some of the algorithms

used for the non-hierarchical problem already use an agglomerative or divisive approach, which implicitly

generates an hierarchy. Non-hierarchical algorithms can somehow be used for this task, if there is a

parameter which allows us to define the granularity of the communities to be detected.

Another important task is clustering refinement. Clustering refinement is a method that, given the

obtained clustering, moves nodes between clusters to further improve the communities already obtained.

A major concern of the research in this subject is time efficiency. Real networks can have several

millions of nodes, which demands a linear or near-linear time algorithm. Layered Label Propagation [5]

and Louvain method [6] are two greedy methods that can solve the clustering problem in near-linear time

and provide solutions with reasonable quality.

A first goal of this work is to study how FastStep [1], a recent method for boolean matrix factorization,

can be used for graph clustering. It was designed to factorize a large, sparse matrix, in a multiplication of

two smaller matrices with a fixed width, but keeping the error as low as possible. This matrix factorization

can also be interpreted as a clustering method, with a fixed number of communities.

In fact, FastStep is a method which is able to find rich community structures and can run in near-linear

time. This method takes a parameter k, which can be used to define the granularity of the communities

intended.

A second goal is to study how to apply it to the hierarchical community finding problem and test

how it can be used for community refinement, based on the ideas used in the Layered Label Propagation

method [5]. This application is also directly related to graph reordering, which will also be addressed in

this context.

A third goal is to integrate multiple community finding algorithms into a common framework for

15

hierarchical graph clustering.

1.2 Context

Network science is a field with lots of interesting developments in this century, with graphs being essential

tools for the study of many complex systems. It has been found that real networks have many common

structural properties [7]. The distribution of edges per vertices is heterogeneous. Usually, the degree

distribution follows a power-law, with many nodes with few connections and few nodes with a lot of

connections. Also, this distribution of edges seems to be organized in communities [8]: higher density

of edges within a group of nodes, with less connections to the elements outside of the group. This

concept of community or cluster is independent of the origin of the graph; it appears to be an essential

element in the structure of real networks. Society, for example, has multiple organizations of people:

families, work groups, villages or countries. We can also find this structure in biology (food webs, protein-

protein interaction networks, metabolic networks), computer science (social networks, customer-product

relations), politics, economics, etc.

Developing methods for community finding is crucial for many applications. These methods, besides

having their use for specific situations, are tools that can help in the further understanding of real networks

and their structure.

Another important task is to find not only the communities of a graph, but also their hierarchy. In

a lot of networks, it is possible to detect not only its communities, but how a community includes some

smaller ones [8]. We can see, for example, the community of people working and studying at a school

and how it relates with the community of people living in that town. It can also be interesting to devise

a way of dividing this school group into three communities (teachers, students and staff).

A large number of methods for graph clustering and hierarchical graph clustering have already been

introduced. If the reader is interested, the review by Fortunato surveys most of the classical approaches

to the problem [3]. Currently, there is some focus on solving the problem for really large networks.

Algorithms that take more than linear time are usually unusable in these situations. With big data being

a major concern nowadays, real networks with millions of nodes and billions of edges present a difficult

challenge that certainly needs to be tackled. Layered Label Propagation [5] and the Louvain method

[6] are two eficient algorithms with good and known implementations, which work for reasonably large

networks. However they may return suboptimal clusterings and alternative or slower methods may get

better results.

FastStep [1] is a method for finding a factorization of a boolean matrix, which offers better inter-

pretability. FastStep is, in particular, able to find structures that go “beyond blocks”, providing more

realistic representations. It can be applied directly to adjacency matrices (which are boolean matrices).

The ability to find rich community structures, among other interesting qualities, shown by its authors,

together with a parameter which allows to define the granularity of the communities we are looking for,

makes it a promising algorithm to be used in the search for hierarchical communities.

One important application of clustering is graph compression or succinct representation of a graph.

As described in [9], it is possible to obtain a smaller representation of the graph after a reordering of

16

its vertices. Information about communities or even better, hierarchical communities, can be helpful to

increase the compression rate. Based on this, it is possible to infer the quality of a clustering algorithm

just by analysing how much the graph can be compressed using this information. We will also follow this

approach in the evaluation of our results.

1.3 Document Outline

Section 1 describes the problem we are proposing to solve and gives some context about the most relevant

work done in this field. Section 2 starts with basic notation when working with graphs and describes

clustering algorithms. Includes classical methods and efficient algorithms, which are the main focus of

this work. Section 3 explains the solution chosen for the problem and the reasons behind it. It also

presents how the solution is evaluated. Section 4 presents the results obtained for the experimental tests,

defined in the previous section, and makes a detailed analysis of them. Section 5 concludes the document,

with general observations on the results obtained and on possible future work.

17

18

2 Background

2.1 Graphs

A graph is a pair (V,E) formed by the a set of vertices V and a set of edges E. In this work, we define

n = |V | and m = |E|. Each edge is a pair (i, j) with i, j ∈ V . In an undirected graph, an edge (i, j) is

the same as the edge (j, i). In a directed graph, it means that there is a connection only from i to j. A

multigraph is a graph that can have multiple connections between i and j.

A community or a cluster can be defined in a simplified way as a set of vertices of the graph with

denser connections between each other and less connections between other vertices of the graph. For

a deeper review on graph clustering, check [3]. Communities are associated with the natural notion of

a group, which can have a complex and intricated structure. A more concrete definition of a cluster

is difficult to give, because it is related with understanding the structure of real network, which is a

problem that is far from being solved. Usually, a community is formalized by defining a specific quality

measure for graph clustering, that is then optimized by some method. Also, a wide variety of problems

that involve finding structures in a graph are NP-Hard, and graph clustering is one of those problems.

Even when defining an exact quality measure to be optimized, it cannot be solved efficiently by current

computers and must be solved heuristically or in a greedy way.

(a) An undirected graph. (b) A directed graph.

Figure 1: Examples of graphs.

2.1.1 Graph representation

Usually, graphs are stored using adjacency matrices or adjacency lists. Some alternative formats can also

be used when space efficiency is a concern.

Adjacency matrices are matrices of size |V |. In an adjacency matrix M , Mij represents the con-

nection between node i and node j. If Mij = 0, then there is no edge between i and j. If there is an edge

between vertices, then Mij = 1. For weighted graphs, the definition can be extended to Mij = k, were k

is the weight of the edge that connects i to j. If a graph is undirected, then Mij = Mji, i.e., M = MT .

The use of an adjacency matrix allows to check if there is a node between any two edges in constant time.

The space complexity of storing this data structure is O(n2), which might not be possible when dealing

with large sparse graphs.

0 1 1 0

1 0 1 0

1 1 0 1

0 0 1 0

(2.1)

Adjacency matrix of graph 1a

19

0 1 1 0

0 0 1 0

0 0 0 1

0 0 0 0

(2.2)

Adjacency matrix of graph 1b

Adjacency lists store, for each vertex i, a list of its neighbours, i.e., the nodes connected to i by

one edge. We can augment these lists with weights if we are working with a weighted graph. The space

complexity of this structure is O(n + m). Checking for an edge between any too nodes can take O(n)

time. We can sort each of the lists, in that case the bound can be lowered to O(log n). Using a hashtable

for each node, it is possible to further reduce the time complexity to O(1), without increasing the space

complexity. Usually, adjacency lists are a good option for large sparse networks.

Compressed Sparse Row (in short, CSR) is a format to store a sparse matrix in O(n+m) space

and, for graphs without weights (i.e., matrices with zeros and ones), uses only two arrays. The first one

(C), of size O(m) stores the indexes of columns in row-major order (from left to right, top to bottom).

The other array (R), of size O(n), stores, for each position i, the position where, in the first array, the

elements of row i start.

For the graph 1a, we can store it in CSR as:

C: 0 2 4 7 8

R: 1 2 0 2 0 1 3 2

Webgraph is a compression format for succinct representation of graphs proposed by Boldi and

Vigna [9]. The format can be seen as an improvement of adjacency lists, and is based on two principles

of web graphs - networks formed by the pages and links in the web:

• Locality - A lot of the urls of pages are usually within some website and used in the context of

navigation, and share a long common prefix

• Similarity - Pages that are near in lexicographically order tend to be similar and have a lot of links

in common

The first principle is implemented by storing the gaps between indexes of nodes in the adjacency list,

instead of storing the indexes themselves. In this way, it is possible to save space in the storage, by using

less memory for each integer than needs to be stored, as they are in fact smaller.

Let us say the successor list of node x is represented by S(x) = s1, s2, s3, ..., sk, where si < si+1. The

indexes si can be quite large, but based on principle of locality, we expect si+1 − si to be smaller. So,

S is stored as s1 − x, s2 − s1 − 1, s3 − s2 − 1, ..., sk − sk−1 − 1. The second principle is implemented by

allowing the successors list of one node to be an incomplete copy of the successors list of another node.

For that, it uses a sets of bits, where a bit i set to 1 represents the copy of successor i. It uses also an

extra list, for other successors not in the ”copied” list.

20

Index Original S Smaller S1 {2, 3} {1, 0}2 {1, 3} {−1,−1}3 {1, 2, 4} {−2, 1, 2}4 {3} {−1}

Table 1: Successor lists for graph 1a, using Webgraph format.

The following table presents an example of the successor lists that we can obtain for graph 1a. For

simplification, copies are not being considered.

2.2 FastStep Algorithm

FastStep is an algorithm for matrix factorization proposed by Araujo et al. [1]. It factorizes a boolean

matrix into two matrices with smaller dimensions and non-negative elements. It allows to understand

the underlying structure of a matrix, and can be helpful for compression, prediction and denoising. The

algorithm gives an approximated result with low error and runs in near-linear time on the number of

non-zero elements.

Figure 2: Adjacency matrix of a community foundby FastStep.

An important feature of this method is the abil-

ity of finding rich community structures. The re-

sults obtained experimentally allow to obtain hy-

perbolic clusters, which, as shown in a recent work

by Araujo et al. [10], is a good approximation to

the internal organization of real communities. Fig-

ure 2 shows a community found by FastStep. Com-

mon algorithms work only in the specific cases of

finding rectangular like structures in the matrix,

and that is the reason why the authors present it

as a method that goes Beyond Blocks.

Another interesting property of FastStep is the

strong interpretability of the boolean matrix de-

composition. The use of non-negative factors in

the matrices allows to establish the importance of elements and is in fact what enables representations of

non-block structures. Also, the boolean reconstruction allows clear predictions and explanations of the

non-zeros.

2.2.1 FastStep formal goal

Let M be a n × m boolean matrix. The goal is to find a n × r non-negative matrix A and a m × r

non-negative matrix B, so that the product ABT is a good approximation of M after thresholding.

21

minA,B||M − uT (ABT )||2F =

∑i,j

(Mij − uT (ABT )ij)2 (2.3)

where ||.||F is the Frobenius norm and uT (X) simply applies a step function to each element Xij .

[uτ (X)]ij =

1 if Xij ≥ τ

0 otherwise

(2.4)

The thresholding operator renders the objective function non-differentiable. In order to solve it, the

function can then be approximated by another with similar objective:

minA,B

∑ij

log

(1 + e

−Mij×(

r∑k=1

AikBjk−τ))

(2.5)

Here, M is a matrix with values in {−1, 1}, where zeros are represented by −1 and ones by 1. This

function is the objective function used by Faststep.

The objective function can then be optimized in several ways. The authors used the gradient descent

method. Let us consider Sij =r∑

k=1

AikBjk. The gradient of the objective function, for Aij , is then given

by:

∂F

∂Aik=

m∑j=1

Bjk1 + eτ−Sij

−∑j∈Mi

Bjk (2.6)

The update rules for B are similar.

It might be important to refer some implementation details. Matrices A and B are projected after

each iteration, and projected to a small value ε instead of 0, as A = B = 0 is a stationary point of the

objective function and there would be no improvement. The τ variable assumed a value of 20 in the

authors implementation, because it allowed to achieve good results.

2.2.2 Complexity

A straightforward implementation of the algorithm would take O(Tnmr2), where T is the number of

iterations needed, n and m are the dimensions of the boolean matrix and r is the rank of the decomposition

(width of both A and B, as stated previously).

Using additional O(nm) memory to store S and updating it every iteration allows to reduce the

complexity to O(Tnmr).

This algorithm runs in squared time in relation to the size of the matrix; grows linearly in O(nm).

For a sparse matrix, it is possible to introduce some approximations to the algorithm that further reduce

its complexity.

As we can see in equation 2.6, calculating the gradient for each element of the matrix Aik requires

O(nm) time. Still, the gradient can be approximated to have a total number of operations in the order

O(E), with E being the number of non-zero elements in the matrix. Real networks are generally sparse:

the number of edges is in the order of magnitude of the number of vertices. Also, This approximation

introduces some error in the decomposition, but it is considerably low.

22

Briefly, the first part of the gradient is a sigmoid, and for the value to have a significant impact on the

gradient it is needed both a high Sij and a high Bjk. So, pairs (i, j) with high AikBjk are first considered,

as they mean high value in both of the sigmoid parameters. As such, equation 2.6 can be approximated

to:

∂F

∂Aik≈

m∑(i,j)∈P (t)

Bjk1 + eτ−Sij

−∑j∈Mi

Bjk (2.7)

where P (t) is the set of elements which are considered to introduce greater gradient variation; r sets

Pk(t) need to be kept for the calculation of each factor Aik of the gradient.

It is also important to note that, initially, only non-zeros contribute to the gradient, which means it

can be calculated using only the second summand of equation 2.7. As the iterations progress, the error

will also move to some of the zeros. However, given M’s sparsity and the symmetry of the error function

– the error of misrepresenting a one is the same as misrepresenting a zero – |P (t)| can be kept small and

in the order of O(rE).

To quickly find the top-t pairs (i, j) of P with highest AikBjk, let ak and bK be columns k of matrices A

and B, respectively. After sorting ak and bk, the biggest AikBjk not currently in Pk can be selected from

a very small set of elements in the matrix. Therefore, one can keep a priority queue with O(min(n,m))

elements and it is possible to select a set of t non-zeros and approximate the gradient of all elements in

factor k in O(t+ n log n+m log m) operations.

In order to detect convergence and after each iteration of the gradient descent, an estimate of the

error F (A,B) is calculated by considering all the non-zeros and an uniform sample of the zeros of the

matrix. The error is then scaled accordingly.

With these modifications, the total complexity can be revised to O(Tr(E+P log(min(n,m))+n log n+

m log m + S)), where S is the number of samples used to check convergence. When our matrix is the

adjacency matrix of a graph, the complexity is:

O(Tr(E + (P + n) log n+ S)) (2.8)

As we can see, the algorithm scales linearly with E, the number of non-zeros. However, when the

number of communities we want to find gets closer to n, the complexity of the method becomes quadratic

on n. Also, there is no result that states the independence of T and r, as the time the algorithm takes

to converge might depend on the size of the matrix we want to obtain.

2.2.3 Obtaining the clusters

For community detection, i.e., to find if an element belongs to a factor k, it is possible to do it directly

based solely on the principles of the decomposition.

For this specific purpose, we can replace the matrices A and B for one single matrix A, and use

the matrix A instead of B in all the steps of the algorithm (i.e., B = A). This approach is acceptable,

as we are looking for communities, and communities are considered undirected. The gradient and its

minimization remains similar.

23

As such, a row element i belongs to a factor k if there is a non-zero in the reconstructed matrix in

row i and if this factor contributed with a weight above τr :

Aik ≥τ

r max(bk)and Si,argmax bk ≥ τ (2.9)

Running the algorithm with different values of the rank r, makes it possible to obtain communities with

a different granularity.

2.2.4 Results

The authors show two empirical results on real data. The error obtained by their tests presented lower

squared error than other scalable methods: SVD [11], NNMF [12] and HyCom [10]. The algorithm scales

linear in the number of non-zeros, as proven previously. In the context of graph data, which is the

concern of the present work, we know that non-zero elements are the edges of the network, which means

the algorithm scales near-linearly with the number of edges. It depends also on the number of iterations

needed for it to converge, which is assumed to be small.

2.3 Label Propagation

Label Propagation [13] (LP) is a near-linear time algorithm for finding communities in a graph. Its

time complexity makes it useful for large networks, where algorithms that take more than linear time

take too long to run. The method of the algorithm is the following:

1. Initialize each node of the graph with a distinct label;

2. Update each node label to the most frequent label found in its neighbours. Ties are broken uniformly

at random;

3. Repeat 2 until it stabilizes, i.e., there are no more updates to the labels of the graph;

According to the authors, very few iterations are usually needed for the algorithm to stop. In fact,

each iteration takes linear time, but no bound has been proven to the number of iterations needed, only

empirical results indicate that.

The communities are formed by nodes of the same label. The algorithm can only find non-hierarchical

and non-overlapping clusters. Due to the way ties are broken, successive runs of this algorithm produce

different partitions. The authors proposed a way to use successive runs of the algorithm, followed by an

aggregation method, to obtain a more informative clustering. The aggregation algorithm is quite simple:

nodes keep a set of all the labels obtained in the successive runs of the algorithm; clusters are formed by

the nodes that have the exact same label set. This aggregation method might provide a lot of fine-grained

communities if many iterations of the algorithm are run.

The Label Propagation algorithm, even though it has a very good performance, tends to include

the majority of the nodes in the same community, due to the inherent topology of real networks. That

problem gave rise to several Label Propagation algorithms that keep the original idea and change the

update rule of the labels.

24

2.3.1 Extensions to Label Propagation

One of the most interesting extensions to LP is the absolute Potts Model algorithm (APM) [14]. In this

algorithm, let us consider, for a given node x, that its neighbours have labels λi, with i between 1 and

K, K being the number of distinct labels in the neighbourhood of x. There are ki neighbours with label

λi. In the simple LP algorithm, the new label λi is chosen such that ki is maximum. APM, instead,

maximizes the following expression:

ki − γ(vi − ki), (2.10)

where vi is the number of nodes in the whole graph with label λi currently assigned.

Another important work based on this method is the Layered Label Propagation (LLP) [5], an

algorithm for matrix diagonalization or graph reordering.

Large networks require special storage concerns, as they usually cannot fit in the memory of a computer

with standard resources. To reduce the space used to store vertex indices, distances between indices are

stored instead. To obtain a better encoding, vertices are reordered to minimize these distance values. This

way, in a adjacency list, we expect to store smaller values, using less space. Intuitively, we can understand

how clustering can be helpful to solve this problem. LLP is an algorithm to do this reordering with this

goal in mind.

The ordering obtained by LLP is used in the Webgraph library, which, among other purposes, can be

used as an efficient framework for graph compression [9]. Webgraph obtained really good results using

this approach.

This algorithm uses, as a starting point, the APM algorithm. As presented by the authors, APM

presents two problems: the parameter γ is hard to estimate by common graph measures, such as size

or density, and the clusters obtained by the algorithm usually follow heavy-tailed distributions, yielding

some really large clusters and a huge number of small ones.

LLP works by running successive iterations of APM, with different γ values, and storing an ordering

of the nodes using both the ordering of the previous iteration and the clustering obtained in the current

iteration. This ordering is defined formally by:

x ≤k+1 y iff

πk(λk(x)) < πk(λk(y)) or

λk(x) = λk(y) ∧ πk(x) ≤ πk(y)

(2.11)

where λk is the resulting labelling function from iteration k and πk is the index of the node in the ordering

obtained at iteration k.

In other words, elements with different labels are ordered with respect to these labels, elements with

the same label keep the ordering of the previous iteration.

Several choices for the different γk values can be made. The authors obtained better results by using a

random γk, chosen uniformly from the set {0} ∪ {2−i, i = 0, ...,K}. They also tried to use the same value

γ for all the iterations, but the results were always worse. In our point of view, this might indicate that

combining communities of different granularity is a way of getting a better compression, which implies

that a good hierarchical clustering method can be used to obtain an even better compression rate.

25

2.4 Modularity Maximization

Modularity is a metric introduced by Girvan and Newman [2] to measure the quality of a clustering.

Initially, defined only for unweighted networks but, in a later work, also defined for weighted ones. Louvain

method, that we will present next, works with multigraphs, which can be represented as weighted graphs.

Therefore, we will use the definition using weighted edges to cover that case too.

This metric is based on the idea that a random graph does not have a community structure. For

that, we consider a null-model, a new graph where the degree distribution is the same as the original

graph, but the edges are rewired in a random way. The existence of a community can be discovered by

comparing, for a specific subgraph, the expected density of edges and the real density of edges. Naturally,

if the density of edges within a subgraph is much higher than expected, it means we might have found a

community.

2.4.1 Definitions

Let us consider a function f : V × V → N , that assigns a non-negative edge weight to each pair of

vertices of the graph. For unweighted graphs, can be defined as one when there is an edge between the

two nodes and as zero otherwise. For multigraphs, we consider the number of edges between the two

nodes. We consider also deg(v) =∑u∈V f(u, v), which represents the degree of the node. f and deg can

be generalized for a set of vertices V , where f(V, V ) =∑u∈V,v∈V f(u, v) and deg(V ) =

∑v∈V deg(v). A

graph clustering C = C1, C2, ...Ck partitions the vertices into k disjoint non-empty subsets Ci ⊆ V .

The expected number of edges within a community C is given by:

deg(C)2

deg(V )2(2.12)

which is the expected fraction of edges of the graph that are within C, i.e., the fraction of edges of the

graph that are within C in the null-model.

The real number of edges within a community C is given by:

f(C,C)

f(V, V )(2.13)

which is the fraction of edges of the original graph that are within C.

The modularity of a clustering C is the sum, for all considered communities, of the difference between

the real number of edges (2.13) and the expected number of edges (2.12). It is given by:

QC =∑C∈C

(f(C,C)

f(V, V )− deg(C)2

deg(V )2

)(2.14)

2.4.2 Algorithms

Maximizing the modularity of a graph is an NP-Hard problem [4]. Several algorithms were proposed to

solve the problem in a greedy way. As an effort to organize existing solutions into a coherent design space,

Noack and Rotta [15] define two types of algorithms for modularity maximizations: greedy coarsening

26

algorithms and refinement algorithms.

A greedy coarsening algorithm starts from singleton clusters and merges clusters iteratively, choos-

ing the merge with highest modularity increase. It can produce reasonable results. However, it was shown

to be biased towards merging larger clusters [16], [17]. Several modifications to the algorithm were pro-

posed, with other priority criteria for merges and with changes on the purely greedy merge strategy. The

results obtained by these coarsening algorithms can be further optimized by refinement algorithms.

Refinement algorithms iteratively move individual nodes between clusters. Here, we can choose the best

move in each iteration, i.e., the one with highest modularity increase. This approach might be heavier

to compute. However, choosing the moves by an arbitrary order is much faster and not necessary less

optimal.

For a review and proper comparison of these alternative approaches and different coarsening and

refinement techniques, check [15].

The modularity optimization may fail to identify modules smaller than a scale which depends on the

total number of links of the network, as shown by Fortunato et al [18]. As such, it is considered that the

modularity optimization methods have a resolution limit: communities smaller that some specific size,

which varies with the graph considered, may not be found by the methods and these communities are

therefore included in larger ones.

2.4.3 Louvain Method

The Louvain Method [6] is a simple method for hierarchical graph clustering, based on the optimization

of the modularity.

The algorithm starts with a weighted network of N nodes, and each node i belongs to a different

community Ci.

1. For each node i in the network

1.1. For each neighbour j of i

1.1.1. move node i to community Cj , i.e., assign Ci to Cj , if the modularity gain of this move is

positive

2. Merge the nodes inside each community into one single node

2.1. The edges between nodes of the community become a self-loop, with weight equal to the sum

of the weight of all those edges

2.2. All other edges are kept and merged if their endpoints are the same.

3. If step 2 made any change in the community structure, proceed to step 1. Otherwise, the algorithm

stops.

The modularity gain from moving an isolated node i into community C is given by:

∆Q =

[f(C,C) + f(i, C)

2m−(deg(C) + deg(i)

2m

)2]−

[f(C,C)

2m−(deg(C)

2m

)2

−(deg(i)

2m

)2], (2.15)

27

where m is the sum of the weights of all the links in the network. A similar expression can be derived

for the modularity change when node i is removed from community C. The algorithm can then calculate

the overall gain by removing i from its initial cluster and moving it to one neighbouring cluster.

Usually, the first pass (running step 1 followed by step 2 in the beginning of the algorithm) is the

heaviest computational task and takes most of the computing time. The method can then run in near-

linear time, if we can consider that the degree of a node is constant and does not depend on the size of

the graph. The number of hierarchy levels resultant from the algorithm is small, which implies few steps

of the algorithm to reach its conclusion.

It is also important to note that the order in which nodes are iterated in step 1 of the algorithm is

relevant to obtaining different results. Even though the order does not seem much relevant to obtain

clusterings with higher modularity values, it can have an impact in the computation time of the algorithm.

Several different tests were made by these authors [6] that show results with both high precision and

good execution times. The algorithm managed to run in 152 minutes for a large network, with 118 million

nodes and 1 billion edges.

The authors argue that the algorithm might partially avoid the resolution limit of modularity maxi-

mization, because it is highly unlikely that, in step 1 of the algorithm, all nodes from one community are

moved to another. In step 2, clusters are merged together, but the smaller ones that were combined are

kept deeper in the hierarchy.

2.5 Classical Hierarchical Algorithms

Classical hierarchical algorithms follow two basic approaches: agglomerative and divisive. These algo-

rithms allow to retrieve a dendrogram which represents the entire hierarchy obtained.

Agglomerative algorithms merge clusters iteratively, joining together communities with higher

similarity, until the number of clusters desired is achieved or only one cluster remains.

Divisive algorithms remove edges of the graph, creating disconnected communities, which become

smaller and smaller as the algorithm runs. The edges of the graph that are removed are chosen by some

metric which tries to separate nodes with less similarity.

2.5.1 Betweenness-based divisive algorithm

An interesting divisive algorithm was proposed by Newman and Girvan [2]. In their approach, the next

edge to be removed is chosen using the edge with the highest value of the betweenness metric, in three

different variations suggested by the authors: edge betweenness, random-walk betweenness and current-

flow betweenness.

Edge betweenness of an edge is the number of shortest paths, between any two nodes of the graph,

that contain that edge. Random-walk betweenness is the same as the edge betweenness, but where a

random-walk from the source to the destination is considered, instead of a shortest path between every

two nodes. So, for each random-walk between any two nodes, there is some probability of choosing a

path that includes some edge e. The value of the random-walk betweenness for edge e is the sum, for

each possible pair of nodes (n1, n2), of the probability of e being included in a random-walk between n1

28

and n2. Current-flow betweenness is based on ideas from electronic circuits. It considers each edge

of the graph as an unitary resistor and each pair of nodes as a source and a sink. Electricity flows in a

larger quantity in shorter paths. We can then obtain the betweenness value for each edge e by adding the

flow obtained, for all possible pairs of (source, sink). The authors prove that random-walk betweenness

is equivalent to current-flow betweenness.

The divisive algorithm proposed works as follow:

1. Compute betweenness values for all edges present in the graph (using the chosen betweenness

metric)

2. Remove edge of the graph with highest betweenness value

3. Verify stopping criteria. If not achieved yet, repeat from step 1

It is also important in the algorithm to recalculate the betweenness after removing an edge, because

successive deletions of edges of the graph can modify completely the values of the edges.

This step is a costly one. The edge-betweenness can be calculated using the algorithm proposed by

Brandes [19] which runs in O(mn) time. This algorithm is then executed at most m times, after m

removals of edges, and will result in a O(m2n) time complexity. Removing m edges can be done in at

most O(mn) time, which does not dominate the running time, given the time needed to compute the

betweenness.

29

30

3 Approach

As stated before, the focus of this work was to study if FastStep can be used to find hierarchical commu-

nities and/or refine communities, while comparing its results with LLP and Louvain method. This was

the major goal of the project.

A first task consisted of understanding how Faststep can be used as a graph clustering tool. Obtained

clusterings could then be evaluated using common metrics. Measuring the quality of a clustering algorithm

is addressed in the subsection 3.1. An important part of the evaluation is also the comparison of the

results with the ones obtained by reference algorithms. We compared Faststep with Louvain method

and Layered Label Propagation. Section 3.2 explains how Faststep was modified with this objective and

compared with the other methods.

A second task is related with the idea that knowledge about the communities of a network can be used

to reorder its vertices and achieve better compression of a graph, when using a framework like Webgraph

[9]. Therefore, for a better evaluation of the clusterings, we used them for graph compression, and the

compression rates were compared for the same reference algorithms. Section 3.3 explains how reordering

can be obtained from the used methods.

A secondary goal of the project was the creation of a framework for graph clustering. It aggregates

the developed method and also LLP and Louvain method, our reference algorithms. The user of this

framework has the possibility of choosing any of the available methods. This tool outputs a clustering or

a reordering for the provided network. It also includes some functionalities to ease the task of comparing

and evaluating clusterings and/or reorderings of the graph.

3.1 Evaluation

Testing the quality of a clustering algorithm was an essential step of this work. To make sure the developed

approaches are useful in practice, they must be validated systematically.

Complementary to the quality tests, we will finish with a complete benchmark of time used by the

developed method and by the reference algorithms.

3.1.1 Initial validation of the results

One simple way of asserting the quality of an algorithm is running the method with datasets where

communities are known. In a first evaluation phase, it is common to use small networks where we can

easily see how the algorithm did and understand why it might have failed. Zachary’s karate club [20] is

a small and classical example that is often used.

We used small graphs as a first validation for our work, but tests using large networks are of utmost

importance, as we wanted to make sure that our methods behaved well, not only on small instances,

but also on large real examples. SNAP [21] provides a collection of social networks with ground-truth

communities identified. These networks have a number of nodes ranging from hundreds of thousands to

millions of nodes. Another good source of networks with communities is the set of benchmarks provided

by Fortunato et al [22]. These tools allow to generate graphs and can be used to systematically create

31

tests for the developed methods.

Another way of evaluating results is by comparing them directly with other known and tested al-

gorithms. The Louvain method and Layered Label Propagation are the obvious target algorithms for

clustering validation. In these tests, we can use any network, as known communities are not necessary.

3.1.2 Comparison Metrics

It is important to note that both evaluation methods require the comparison of the clustering obtained by

the algorithm in test with either ground-truth communities or a clustering obtained by another algorithm.

Comparing different clusterings is also an interesting problem. One simple way of doing it is using the

Jaccard index. The Jaccard index evaluates the similarity of two sets. Considering two sets A and B,

Jaccard Index =A ∩BA ∪B

(3.1)

When comparing two clusterings A and B, we want to find, for each cluster C in A, the cluster

in B which has smaller jaccard distance with C. This metric allows to evaluate how well each cluster

is detected between two different clusterings. When looking for an overall metric, we can average the

smallest jaccard distances for each cluster of A. In this work, when referring to the Jaccard Index in the

context of the comparison of clusterings, it is assumed that we are considering this average. One could

also argue that a variance should also be shown when averages are performed. In our point of view, the

Jaccard Distance is used just as a control measure, as we use another metric, which is shown next, to do

a proper comparison of clusters. Showing also the variance would not help the reader in the analysis of

the results, as it would probably be excess of information. Also, we can have an idea of the distribution

of the Jaccard Distances in subsection 4.3.

We can also use a specific metric of similarity to compare two clusterings. One of the most commonly

used is the Normalized Mutual Information (NMI).

NMI is based on Information Theory principles. The main idea is to consider that if two clusterings are

similar, then we need little information to infer one clustering from the other. Let us define the two dif-

ferent partitions (or clusterings) of the graph as X = (X1, X2, X3, ..., XnX) and Y = (Y1, Y2, Y3, ..., YnY

),

where Xi and Yj are clusters/sets of vertices in X and Y, respectively. n is the number of nodes of the

graph and nij is the number of nodes shared by clusters Xi and Yj . The community assignments {xi}

and {yi} define the cluster to which node i belongs in clusterings X and Y. We then consider the labels x

and y as values of two random variables X and Y , with distribution P (x, y) = P (X = x, Y = y) = nxy/n,

which implies P (x) = P (X = x) = nXx /n and P (y) = P (Y = y) = nYy /n. We can then obtain the mutual

information of the clusterings X and Y , which is given by I(X,Y ) = H(X) −H(X|Y ), where H is the

Shannon entropy. Check Appendix A for an entropy definition.

Using the mutual information as a similarity measure doesn’t work that well, because all partitions

obtained from X by further partitioning its clusters have the same mutual information with X (even the

one where each node has its one cluster). Danon et al. [23] proposed the normalized mutual information

to avoid this problem:

32

Inorm(X ,Y) =2I(X,Y )

H(X) +H(Y )(3.2)

We will use NMI in the comparison of different clustering for the evaluation of the developed algorithms.

The only problem with the use of NMI is that it doesn’t work for ground-truth communities where

nodes can belong to more than one cluster. Lancichinetti et al. [24] presented an extension for the NMI,

which was further improved by McDaid et al. [25].

When extending the NMI to overlapping clusters, the previously used notation cannot be used directly,

as xi and yi assume that there is a direct assignment of a vertex to a cluster, which does not happen for

this problem, as a node can belong to more than one cluster.

Clusterings use another notation in this situation. Each vertex v has an associated binary array of

size |P |, where each k-th bit, (xv)k, is either 1 if v belongs to cluster k or 0 otherwise. The random

variable Xk denotes de random variable associated with the k-th bit and partition P . The probability

distribution of Xk is:

p(Xk = 1) =nkn, (3.3)

p(Xk = 0) = 1− nkn, (3.4)

where nk is the number of nodes in cluster Ck ∈ P , i.e., nk = |Ck|, and n is the total number of

nodes, n = |V |. We define, for a second partition P ′, (yv)k, Yk and C ′ in the same way as (xv)k, Xk and

C, respectively. The joint probability of Xk and Yl is given by:

p(Xk = 1, Yl = 1) =|Ck ∩ C ′l |

n(3.5)

p(Xk = 1, Yl = 0) =|Ck| − |Ck ∩ C ′l |

n(3.6)

p(Xk = 0, Yl = 1) =|C ′l | − |Ck ∩ C ′l |

n(3.7)

p(Xk = 0, Yl = 0) =n− |Ck ∪ C ′l |

n(3.8)

We can also define the conditional entropy between Xk and Yl

H(Xk|Yl) = H(Xk, Yl)−H(Yl) (3.9)

and, since we are interested in the best possible matching between P and P ′, we can select the Yl

that minimizes the entropy,

H(Xk|Y ) = min H(Xk|Yl) (3.10)

We can then take the normalized equation 3.10 and calculate its average for all Xk distributions:

H(X|Y )norm =1

|P |∑k

H(Xk|Y )

H(Xk)(3.11)

33

H(Y |X)norm can be defined in the same way. We finally define the normalized mutual information,

as presented by Lancichinetti et at. [24]

ONMILFK(X|Y ) = 1− 1

2(H(X|Y )norm +H(Y |X)norm) (3.12)

McDaid et al. [25] argue that this definition of the NMI overestimates the similarity of the clusterings.

When considering two partitions P and P ′, if P has only one cluster which is exactly equal to one cluster

of P ′, we would expect a low value of NMI, but using this definition we would get at least 0.5. If we

instead have a partition P where all permutations of valid clusters are contained, which corresponds to

2n − 1 different clusters, the value of NMI relatively to partition P ′ would also be greater than 0.5. The

authors propose an alternative definition to deal with these faults:

ONMImax =I(X,Y )

max(H(X), H(Y ))(3.13)

3.1.3 Succinct representation as clustering metric

Another good way of evaluating clusterings is testing how well they work for graph compression or succinct

representation of a graph. As stated previously, Webgraph [9] uses this knowledge to obtain a reordering

of the vertices of the graph, which is useful to achieve a better compression of graph files. Our method

can then be evaluated by the achieved compression rates, which will also be compared with a random

permutation of the vertices and with the results obtained by the reference algorithms.

3.2 Graph Clustering

3.2.1 Faststep

As explained earlier, Faststep can take near-linear time for increasing size of the graph. However, the

algorithm scales quadratically with the number of factors in the matrix we want to obtain. With this

consideration in mind, it is expected that the algorithm would not be fast enough as the number of

factors approaches the size of the graph. In the considered reference algorithms, it is possible to obtain

a high granularity of communities. It is also an important goal of this project the usage of Faststep as a

reordering method to achieve better compression of a graph. High granularity of the communities is, in

fact, one of the reasons that allows a clustering algorithm to produce a good reordering of the graph.

To have a basic understanding on how Faststep would be affected by an increasing number of factors,

we used the youtube dataset [26], a network constituted by approximately 1 million nodes and 3 millions

edges and ran Faststep with different number of factors. Faststep running time depends also on the time

needed for the gradient descent to converge. So, even though a larger value of k doesn’t imply directly

that the algorithm will take a longer time to run, we discovered that even for relatively small values

of k (k=32), the algorithm would take days. To overcome this problem, we decided to create a new

recursive method, based on Faststep and using only small values of k. That method starts with the entire

network. In the first step, it runs Faststep with a factor of 1. That way, we are trying to reconstruct only

one community of the graph. We then partition the graph into two: the obtained community and the

34

remaining vertices of the graph. In a second step, the method is called for each of these two graphs. It

stops partitioning when a specific threshold for graph size is reached. The pseudo-code of the algorithm

is the following:

Function divide(graph)if graph.size < threshold then

graph.save();return;

d = faststep(k=1);g1 = list();g2 = list();for i = 0 to graph.size do

if belongsToCluster(d.rows[i][0]) theng1.add(i);

elseg2.add(i);

end if

end fordivide(g1);divide(g2);

endAlgorithm 1: Recursive Faststep Method.

The presented method allows to obtain hierarchical communities. Also, the algorithm can be modified

to work with k greater than 1, but it is not trivial to define which k value to use for two reasons: when

using k greater than one, we are looking for more than 1 community, which might, in fact, not exist, and

the obtained communities may be overlapping, which has also to be solved for the algorithm to work.

The function belongsToCluster was initially defined to use equation 2.9, which determines if a node

belongs or not to a community, according to the authors of Faststep. The results of using this approach

were very unrealiable: the communities found depended a lot on the number of iterations used internally

for each factor, and its behavior depended on the network used. In most of the example graphs we used,

Faststep would find no communities, or very small communities, even in large graphs. The algorithm

would find small communities in the graph, which usually it could divide hierarchically, while the majority

of the graph would be left untouched, with no clusters found.

We decided to change the way communities are chosen, using directly the values obtained on the rows

of the matrix. In this situation, with the number of factors equal to 1, we have only one row in the

matrix, which we can use to tell how relevant each node is for the community. We analysed the values

obtained in this row for real networks, and we could infer that they usually followed a power-law, as seen

in Figure 3, which is expected. Our idea was to consider only the elements whose values in the matrix

were greater that some specific cutoff value, which depended directly on the values found. We obtained

acceptable results when using the average of the values as the cutoff. That way, we managed to remove

the elements that were in the long-tail of the power-law, keeping only the ones with high value.

This strategy allows to aproximate the community found. As such, even if there is a clear difference

in value between elements inside the cluster and elements outside, it is highly unlikely that the algorithm

finds the exact cluster. The only reason we had to adopt this method is that the initial equation, proposed

35

by the authors, would simply not return valid results, as explained.

Figure 3: Values of the matrix obtained byFaststep (single row) for a real network, indecreasing order.

The method was then modified to use an iterative ap-

proach where communities are kept in a queue, and larger

communities are split first. When all communities currently

in the queue are smaller than the threshold, we can save

this information as one valid clustering of the graph. The

threshold can then be further reduced, allowing the method

to continue, retrieving more clusterings of higher granulari-

ties. This approach keeps all the subgraphs in a heap, with

the larger subgraph on the top. The larger graph can then

be removed from the top and split into two, which are added

to the heap. Every time the size of the larger subgraph is

smaller than the current threshold, the clustering obtained

can be stored. The entire structure of communities is kept in a tree, which is an auxiliary structure of

the algorithm.

One problem with this method is that it may force the clusters to have the same size, as they are

broken into smaller ones according to the number of nodes they have.

The complexity of this method is the same as Faststep (check equation 2.8), with the difference that

the number of executions of the algorithm is greater, and cannot be directly estimated. We are using the

algorithm for graphs, so n = m, and r = 1 as defined by the algorithm. The complexity is then:

O(IT (E + (P + n) log n+ S)), (3.14)

where I is the number of runs of Faststep needed for the termination of the method. The number of

runs cannot be bounded directly and depends on the configuration of the network.

3.2.2 Louvain Method

Louvain Method is, as explained previously, a hierarchical clustering method. As such, when looking for a

clustering of the graph, we can get several valid clusterings, with different levels of granularity. Each valid

clustering can then be partially decomposed: we may pick one cluster and replace it with the clusters

from a clustering with a higher level of granularity. Therefore, the comparison of the clusters obtained

by louvain method with the clusterings obtained by another method cannot be done directly. To simplify

the process, reducing the number of comparisons to be done, the clusterings considered are only the ones

that correspond to the same level of depth in the hierarchy of communities i.e., the communities that are

on the same depth in the dendrogram obtained by the algorithm.

3.2.3 Layered Label Propagation

LLP uses internally several iterations of APM, with different values of λ. The results of these iterations

can be retrieved and used directly in the comparison of clusters.

36

3.3 Graph Compression

As stated previously, one of the goals of this work is to understand how hierarchical clustering algorithms

can be used to obtain good compression of graphs.

A key step is to obtain a reordering of graph from hierarchical communities. This can easily be done

by storing the hierarchical communities in a tree, where leaves are the nodes of the graph and each

internal node represents a cluster. We can then traverse it using a Breadth First Search. The order of

each node in the obtained reordering is given by the order in which it occurs in the search. The choice

of which child node to search into first is arbitrary.

This approach guarantees that closer nodes in hierarchical structure get closer positions in the obtained

reordering, which implies a higher level of obtained compression.

Therefore, any method that is capable of obtaining hierarchical communities also solves the problem

of reordering the graph. Both recursive Faststep and Louvain method are hierarchical, so this algorithm

can be directly applied to obtain the clustering.

3.4 Implementation

3.4.1 Recursive Faststep implementation

Recursive Faststep was implemented in C++, and used the code provided by the original authors. As

explained previously, we use Faststep with k = 1. The number of samples is set to 2×E, which provides

a good compromise between running time and the quality of the results obtained.

3.4.2 Command Line tool

The algorithms used in this project are implemented using different programming languages. The easier

way to merge together these methods was to provide a command line tool, which internally make system

calls to run them. Some extra features are also included in the tool, such as comparison of clusterings,

checking the size of resulting graph files and obtaining a labelling from any set of clusterings provided.

The project source files are hosted on Github [27].

Some parts of the tool are implemented using C++, when performance is necessary, and some are

implemented using python, as it provides easier interaction with files without using operative system

specific methods. The tool was developed and works on Unix-like systems. It was not tested on Windows,

but it should work if python and java are installed and the program is properly compiled with this target

in mind.

Graphs are stored using Webgraph’s .graph files. Reorderings and clusterings of the graphs are

encoded in plain text, after being converted from the original output files of the algorithms. Table 2

describes the fhgc command line tool and the available instructions.

37

https://github.com/michel94/fhgc-tool

Command Arguments

create-dataset <raw-data> <graph-out> [options]

creates a Webgraph graph, with .graph, .properties and .offsets

files, given the data file and the output base file

raw-data: path to a file containing one line ri per edge. Each line

contains two space separated integers ri,0 and ri,1, the endpoints

of edge i of the graph. By default, the edges are interpreted as

undirected and the numbering of the vertices starts at 1.

graph-out: base path to the webgraph files: .graph, .properties

and .offsets.

Options:

–sub0: vertex numbering starts at 0

–directed: edge (ri,0, ri,1) is interpreted as a directed edge from

ri,0 to ri,1

clusters faststep/louvain/LLP <graph> <output-folder>

generates the clustering for a graph using some method

faststep/louvain/LLP: available methods for clustering

graph: graph base path

output-folder: output folder where communities file will be writ-

ten

labels faststep/louvain/LLP <graph> <output-file>

generates the reordering for a graph using some method

faststep/louvain/LLP: available methods for reordering.

graph: graph base path.

output-file: output file where the reordering will be written.

clusterings2labels <graph> <clusterings-prefix> <output-file>

generates the reordering for a graph using a set of clusterings

(which may or may not be hierarchical), provided externally


clusterings-prefix: the prefix of the files containing the cluster-

ings. Each clustering file provided must have a n lines, where each

line contains a single integer ci, the community node i belongs to.

output-file: output file where the reordering will be written.

reorder <graph-in> <graph-out> <indexes>

reorders a graph using the provided reordering file

graph-in: input graph base path.

graph-out: output graph base path.

indexes: output file where the reordering will be written.

38

compare-clusters <cluster-file-1> <cluster-file-2> [Options]

compares clusterings using a metric: either Jaccard distance or

NMI (default metric)

cluster-file-1: clustering file path. The clustering file provided

must have a n lines, where each line contains a single integer ci,

the community node i belongs to

cluster-file-2: output graph base path. Same file format.

Options:

–nmi: Uses NMI as clustering metric

–jaccard: Uses Jaccard Distance as clustering metric

compare-

clusters-list

<cluster-file-1> <cluster-file-2> [Options]

compares a set of clusterings with another set of clusterings, using

a metric: either Jaccard distance or NMI (default metric)

regex-clusters-1: multiple clustering files. This argument sup-

ports regex (using the same syntax of the command line tool ”ls”),

which must be defined between quotation marks. Each clustering

file provided must have a n lines, where each line contains a single

integer ci, the community node i belongs to

regex-clusters-2: multiple clustering files. Same file format.

Options:

–nmi: Uses NMI as clustering metric

–jaccard: Uses Jaccard Distance as clustering metric

size <graph> <more graphs>

Returns the size of one or more graph files


more graphs: base paths of more graph files

help

Show usage information

Table 2: How to use the fhgc tool.

39

40

4 Results

4.1 Datasets

Multiple datasets were used for the evaluation of the algorithms. We were interested in experimenting

with networks from different origins and with different sizes. Another important aspect for their choice

was the availability of ground-truth communities, which allows us to see how the algorithms perform

under real situations.

The datasets used are presented in Table 3.

Dataset Description Nodes Edges Directed Ground-truth

communities

Airports [28] Network of airports, where airports are nodes andedges are relations of the type ”airport A has directflights to airport B”.

3425 37595 Yes No

Amazon [29] Network of products sold at amazon.com. The con-nections between products represent the relations”Customers who bought this item also bought”,available in the website.

334863 925872 No Yes

Web-google[30]

Web graph released by Google as part of a program-ming contest. In a web graph, web pages repre-sent nodes and links between them represent directededges.

875713 5105039 Yes No

Youtube [26] Youtube social network, where users can form friend-ship relations with each other and create groupswhich other users can join. These groups are con-sidered ground-truth communities.

1134890 2987624 No Yes

Wiki [31] Web graph obtained from Wikipedia pages belongingto the top categories (categories with more pages)and considering links between pages as directededges. Ground-truth communities in the graph arethe categories themselves.

1791489 28511807 Yes Yes

Table 3: Datasets used: source and information.

4.2 Detection of communities in artificial networks

As a first basic test for the clustering algorithms, we generated random networks with a defined number

of communities and evaluated the results obtained.

The networks created had two fixed parameters: number of nodes equal to 10000 and average node

degree, k, equal to 50. We also considered two variable parameters: number of communities C and

probability of rewiring prewire. We then distributed the nodes randomly between communities: each

node belongs to one and only one community. Each community Ci would then have ni nodes. We then

chose, for each community i, k×ni

2 edges randomly. Each edge was a link between a node inside the

community chosen randomly and another node that could either be:

• with probability 1− prewire, a node of the same community Ci, also chosen randomly

• with probability prewire, a node of another community chosen uniformly between the communities

of the graph and between the nodes of the chosen community.

These benchmarks were quite simple and offer only tests with communities with similar size. The

parameter C took values from to 2 to 10, and values from 10 to 50 with intervals of 5, and prewire took

41

values from 0.0 to 0.9, with intervals of 0.1. The results of the tests were evaluated against the original

communities, using both NMI and Jaccard Distance as defined in subsection 3.1. Each test was run using

5 random networks with the defined parameters and the average of the metric value obtained was taken.

Figures 4 and 5 present the results for the three algorithms considered, for the NMI and Jaccard Index

metrics, respectively.

Figure 4: Benchmark results, using the NMI metric, for the tested methods: Louvain (left), LLP (center),Faststep (right).

Figure 5: Benchmark results, using the Jaccard Index metric, for the tested methods: Louvain (left),LLP (center), Faststep (right).

We can easily see that Louvain Method and LLP obtained similar results: when the communities

were not excessively degenerated by the rewiring, they detected all communities with small errors.

Faststep obtained poor results: the values of both NMI and Jaccard metrics do not exceed 0.6, for

all the tests considered. As expected, lower values of rewiring probability imply better reconstruction of

the clustering.

We can argue that the communities and the nodes of the test graphs present a very homogeneous

network, with low clustering coefficient relatively to the number of edges in the graph. Faststep is

42

designed for real networks, which usually are scale-free networks, have high clustering coefficient and

contain hyperbolic communities. Also, as the communities of graphs are equally strong, i.e., have similar

density and size, it is difficult for Faststep to pick only one. The error accumulated by these assumptions

might affect the performance of the algorithm.

This test presents strong indications that Faststep might not perform well in the next tests, but we

do expect better results when using real networks.

4.3 Detection of communities in networks with ground-truth communities

The Amazon, Youtube and Wiki datasets provide ground-truth communities, i.e., the datasets contain

information that allows us to infer known communities of these networks. In these datasets, the ground-

truth clusters have some degree of overlapping, different granularities and sizes, which make difficult for

the tested methods to obtain good results.

In Figure 6, we present the results for the three datasets using both Overlapping NMI and Average

Jaccard Distance between the ground-truth clusters and the found communities, comparing directly the

score of obtained clusterings for the different methods. We then find, for each ground-truth community,

the respective obtained cluster that has the smallest Jaccard distance. In Figure 7, we present the plot of

the sizes of the ground-truth communities against the Jaccard distance with the most similar community

found. This way, we can access how biased the methods are to finding communities of specific sizes.

For the Amazon dataset, using the Jaccard distance, LLP and the higher granularity clusterings of

Louvain method achieve results of around 80%. However, good results in the ONMI metric are much

harder to achieve. The results didn’t exceed 5%, but in relative results, when comparing the curves

obtained for LLP and Louvain, they seem to agree with the curves for the Jaccard distance.

The Youtube dataset got low results for the Jaccard distance, both for LLP (up to 30%) and Louvain

method (up to 20%). The results for ONMI were too low, which allows to understand the communities

found were not, in any way, similar to the ground-truth ones.

The results for the Wiki dataset are also quite low for both metrics. The Wiki dataset is a graph

naturally denser, which usually can make the found clusters to be larger. The Louvain method would

usually stop in 2-3 iterations, as the modularity would continually increase on each merge. Also, this

dataset is a topic network and, as we know, topics are interdisciplinary and hierarchical, which implies

that a lot of different clusterings can be accepted, but not when testing against these ground-truth

communities.

The results for Faststep are of almost 0% for both metrics and for all datasets. We can only suppose

that the communities found by the method are simply different from the known communities used.

The observation of the Figure 7, allows us to conclude that there is no correlation between the size of

the ground-truth community and Jaccard distance of the most similar cluster found.

As an overall evaluation of this set of tests, LLP and Louvain method detected the ground-truth

communities of the Amazon dataset, with some errors. The results for the other datasets were not as

good. It is also important to remember that these two methods run in near-linear time for the number

of nodes of the graph, compromising the quality of the results obtained.

43

Faststep presented really low results in this test. We could accept some degree of difference between

known communities and found clusterings. However, the other methods performed much better at finding

the known communities, which implies that these can be partially approximated using the structure of the

graph, and Faststep is simply failing to detect them. The next results will help to clarify this situation.

44

Figure 6: Detection of ground-truth communities using different levels of granularity of the methods,using three different datasets: Amazon (top row), Youtube (middle row) and Wiki (bottom row). Theresults are evaluated with two different metrics: ONMI (plots on the left column) and Jaccard Distance(plots on the right column).

45

Figure 7: Detection of ground-truth communities: relation between size of each ground-truth communityand quality in which the methods found it. The datasets used are Amazon (top row), Youtube (middlerow) and Wiki (bottom row). The methods used are Louvain method (left column), LLP (middle column)and Faststep (right column).

46

4.4 Comparison of clusterings obtained with different methods

Clusterings obtained by different algorithms can then be compared directly. For this test, ground-truth

networks are not needed, allowing us to use any graph. As done previously, we compare each clustering

of each algorithm. Figure 8 presents the results using the NMI metric while Figure 9 presents the results

for the Jaccard Distance.

An interesting finding that is important to analyse these results is the fact that all clusterings obtained

with LLP have usually higher granularity than those obtained with Louvain method. In some cases, the

higher granularity clustering from Louvain method (first level of iterative modularity merges) can have

more clusters than the one with lowest granularity from LLP, but they usually have a similar number of

communities. With this fact in mind, we can then expect the most similarity between clusterings to be

in these two, for all datasets, and is in fact what happens.

There is a high similarity between LLP and Louvain clusterings, when considering the NMI metric.

For this metric, the worst results are for the Wiki dataset. As discussed earlier this dataset is much

denser than the others, which makes the first iteration of Louvain method to merge more than usual.

We can see for example, with the youtube dataset, which has a similar size, that in the first iteration of

Louvain method, around 200000 clusters are obtained, while for the Wiki dataset, only 6655 are returned.

For this Wiki dataset, in the next iteration, the number of clusters is reduced to only 77. LLP, in its

iteration that outputs less clusters, returns 7677, which is rather similar. In the next one, returns 174698

clusters. We can see that in this case there is a big difference between the granularities of the clusterings

obtained by the two methods, which naturally is reflected on the results.

The similarity between Faststep and the other two methods is rather low. The values for this measure

are not greater than 0.8 for any the tests.

We can also recognize in the tests between Faststep and LLP that the NMI values are, in some degree,

independent of LLP clusterings (see results for Amazon or Web-google). This means that the NMI is

just increasing with the granularity of Faststep clusterings. This problem is related to the way NMI

is defined: a clustering with very high granularity can be seen as a further partitioning of an already

existing clustering, which corresponds to relatively high NMI value, instead of a low value, as it would be

expected. NMI is designed to penalize this type of mismatch between clusterings, but the normalization

is usually not enough to all the cases, and can usually penalize too much in cases where the similarity is

indeed considerable.

For the tests with Louvain method, this issue doesn’t occur due to the fact that granularity of the

Louvain clusterings is small. The mutual information between large clusters and really small clusters will

be low. This way, we obtain a lot of values near zero.

When considering the Jaccard metric, we can see that the values are much lower, except for the

Amazon dataset. It is interesting to note that both algorithms (LLP and Louvain method) obtained

good results for this dataset, in the test for ground-truth communities, which indicates that this dataset

might be a good testing example, having features than make algorithms perform especially well with it.

Probably, the relation between Amazon products is very well structured, with one and only one possible

clustering, which makes the methods agree, which happens here. The results obtained for the other

47

datasets were worse.

We can see that there is a large difference between the two metrics used. This difference can be

accepted in the basis that NMI is a metric specifically tailored for this type of test, while Jaccard distance

is a more generic one. Also, NMI evaluates how much information two clusterings have in common, while

Jaccard distance is focused on the number of common elements. Even so, we can observe, for the Jaccard

metric, that the clusterings with more similarity are the high granularity ones obtained by Louvain and

the lower granularity ones obtained by LLP, as discussed previously.

Faststep results are extremely low, when using the Jaccard metric.

Overall, when comparing Faststep with the other two methods, we have reasons to believe that they

indeed produce very different clusterings. The similarity obtained using Jaccard metric is quite low,

while the similarity using the NMI is overall high, but doesn’t make much sense as there is no correlation

between the granularities of the clustering and the value of NMI obtained. These results seem to indicate

that clusterings obtained by Faststep are quite different from the ones obtained with other methods. We

can still evaluate independently the clusterings of each algorithm, taking advantage of graph compression,

which we do in the next test.

48

Figure 8: Comparison of the clusterings obtained using the NMI metric. The rows, from top to bottom,present the results for the airports, Amazon, Web-google, Wiki and Youtube datasets. Left column plotsLouvain against LLP, middle column plots Louvain against Faststep and right column plots Faststepagainst LLP. 49

Figure 9: Comparison of the clusterings obtained using the Jaccard Distance metric. The rows, from topto bottom, present the results for the Airports, Amazon, Web-google, Wiki and Youtube datasets. Leftcolumn plots Louvain against LLP, middle column plots Louvain against Faststep and right column plotsFaststep against LLP. 50

4.5 Compression Results

As a final test, we can evaluate the quality of the algorithms by using their clusterings to obtain a

reordering of the graph, and see how much space can be saved when this graph is compressed using

Webgraph. This test is a useful technique to measure the quality of a clustering. As seen previously,

Faststep does appear to find really different clusterings, when comparing with other methods or known

communities. This test can help us to understand if, although obtaining different results, its results are

in fact acceptable.

The original permutations of the vertices of the datasets used are usually in some order which allows

better compression using Webgraph. As such, the clustering/reordering algorithms are run on a randomly

permuted version of the original graphs, to eliminate the impact of the initial ordering from the effective

compression rate. The sizes of the compressed graphs are presented in Table 4. The space savings of the

methods are presented in Table 6.

Random permu-tation

Reordered withLLP

Reordered withLouvain

Reordered withFaststep

Amazon 5530 kB 1881 kB 1745 kB 4015 kBAirports 50 kB 25 kB 33 kB 37 kBWeb-google 15274 kB 3650 kB 3658 kB 8221 kBWiki 79586 kB 51913 kB 62702 kB 67879 kBYoutube 16548 kB 8848 kB 10724 kB 13652 kB

Table 4: Size occupied by the dataset after being reordered with different methods.

Reordered with LLP(%)

Reordered with Lou-vain (%)

Reordered with Fast-step (%)

Airports 50.00 34.00 26.00Amazon 65.99 68.44 27.40Web-google 76.10 76.05 46.18Wiki 34.77 21.21 14.71Youtube 46.53 35.19 17.50

Table 5: Space savings in percentage, relatively to the randomly permuted graph file.

LLP is an algorithm specifically designed for graph compression using Webgraph. Nevertheless, Lou-

vain method managed to obtain similar results, gaining by a bit in the Amazon and Web-google datasets.

As discussed previously, Louvain method usually obtains communities with a lower granularity than

the ones provided by LLP. In some graphs, more than in others, the first iteration of Louvain method

merges a lot a nodes, achieving a lower granularity of communities. When compressing a graph, we want

to do a full reorder of its nodes. To achieve that, we need high granularity clusterings: when using low

granularity clusterings, some sets of nodes will simply stay in the same order, not improving compression.

We believe that, for some of the datasets used, that is exactly what is happening with Louvain method.

In some networks, for example, in the Wiki one, the number of clusterings obtained by the Louvain

method is also too low. For a better compression, not only we need a lot of clusterings with high

granularity but it is better if each cluster in the hierarchical structure has the least number of children.

In the way that a hierarchical clustering is transformed into a reordering of the graph, the traversal

51

method of the structure has to decide which child node to visit first. This choice is completely arbitrary,

but it has an impact on the compression obtained. If two clusters are siblings in the hierarchical tree,

we do not know which of them to traverse first, but if one is the parent of the other, that ambiguity

disappears. Therefore, we can reduce this ambiguity if there are more levels of hierarchy which, for

Louvain method, some times there isn’t. This problem could easily be overcome by using the entire

hierarchical structure of the network generated internally by the method, which is not an direct output

in the implementation provided by the authors.

Faststep obtained reasonable results. Naturally, the compression obtained is higher when the compres-

sion obtained by the other methods is also high, although Faststep results are always much lower than the

other methods. In the previous tests we have already introduced several reasons why the performance of

Faststep would probably not be as good as the other algorithms. Firstly, Faststep builds one community

at a time. In that way, all nodes that belong to overlapping communities tend to be included in the first

community created, while nodes with less edges but some proximity with this community that is being

created are simply left out. Therefore, a large error is introduced in the first step of the algorithm. This

error is successively increased by the next steps of the method. Also, the threshold which defines whether

a node belongs or not to the community is just a loose approximation. However, we can still detect some

compression of the graphs, which implies that Faststep does indeed find meaningful communities in the

context of graph reordering and/or compression, which remained a questionable assumption in the other

tests performed. Still, we empirically showed that is not the best method for the job, and that there are

much better alternatives for this purpose.

4.6 Performance Evaluation

For a final evaluation of the methods, we measured the execution time needed for each for the datasets

used.

The following results were obtained on an Intel(R) Xeon(R) CPU E7 - 4830 @2.13GHz computer,

with 64GB of RAM, running Linux, version 4.4.29.

LLP Louvain Faststep

Airports 2.21 0.431 5.76Amazon 32.76 36.00 949.8Youtube 54.09 68.99 2443.03Web-google 116.24 59.53 4138.67Wiki 271.32 234.65 18270.34

Table 6: Execution time (in seconds) of each algorithm, for each dataset. Rows sorted by number ofedges in each dataset.

LLP and Louvain have very efficient implementations and it is easily observable that they have similar

execution times. It is also interesting to observe that there is no clear “winner” between them: LLP is

faster in two datasets, Louvain method is faster in the other three.

Faststep was much slower. Its execution time can be justified by our approach to the problem. Faststep

allows to set both the minimum number of iterations and final error we want to obtain. Until the defined

52

criteria are met, the algorithm keeps iterating over the matrix. As we have seen in the previous tests,

the results were far from good, and we chose to set a very low error value, trying to obtain the best

results possible. It could also be argued that we could have compromised the quality of the clusterings to

improve the running time. However, in our point of view, the most important contribution of this work

is to identify if Faststep could or could not be used for these problems in study, i.e., check if the quality

of its results is similar to quality the other tested algorithms. In that way, much less importance is given

to the running time of the algorithm. In a situation where the quality of the results was better, a greater

effort could have been put in improving the overall performance of our Faststep version.

53

54

5 Final Remarks

In this work, we proposed an adaptation of Faststep for the problem of hierarchical clustering. We

compared its results with two state-of-the-art algorithms, LLP and Louvain method. We tested how these

three algorithms could be used for graph reordering and compression of networks using the Webgraph

framework. Finally, we implemented a tool which allows the retrieval of graph clusterings and reorderings

and the use of these clusterings or reorderings in the compression of the graph.

Due to the fact that communities could not be retrieved directly using the method provided by the

authors, we did a possible approach to this problem. Our method tries to pick one community at a

time, in contrast to other algorithms that propagate labels or merge clusters incrementally. This has a

important impact on overlapping nodes. Overlapping nodes are usually hubs and tend to have many valid

communities. In this situation, they cannot be placed on more than one community, and so Faststep

places them on the first that is found. The other methods can somehow balance this effect, as communities

are constructed incrementally and can gather the more important nodes to themselves. This approach

can lead to a lot of variations between the communities found by Faststep and the communities found by

another algorithm. Also, the cut-off added to the method to decide whether a node belongs or not to the

community introduces a lot of error. As said previously, the simultaneous construction of the communities

allows them to control each other; Faststep, on the other side, uses a much more crude approach which

can place too much nodes in the cluster, but also too few. The method can have a reasonable performance

when dealing with difficult graphs, especially when communities are heterogeneous, as the boundaries

between communities are not well defined and the error is acceptable. The results obtained on graph

compression through reordering prove this idea. However, when dealing with graphs with clear clusterings,

even ones that are perfectly bounded, the implemented cut-off is incapable of accessing this information

and introduces a lot of error.

This details of the method obtained worse results than we could expected in the beginning of the work.

It was specially inaccurate for artificial communities, which don’t have the features that characterize real

networks. When applied to real networks, the clusterings obtained were very different from the ones

provided by LLP or Louvain method. Despite these results, it managed to obtain reasonable ones in

graph reordering/compression.

The performance of Faststep could benefit a lot with an improved method to decide if a node belongs

or not to a cluster. However, we understood the limitations of the algorithm. In fact, Faststep was

designed to solve a different problem, matrix factorization. As such, its results, although interesting

for the comparison of the two problems, cannot be expected to be better than the ones obtained by

algorithms specifically proposed for this matter.

We believe that our study of Faststep as a clustering, hierarchical clustering and reordering algorithm

allowed to explore the possibilities of the method. Even though there are other approaches to this

problem, we consider that there is not much that can be done after this, as the results are a long way

from those of LLP and Louvain method, and any possible improvement would not, in our point of view,

be enough to make it competitive in both quality and execution time.

55

However, we think that Faststep could be improved in a lot of ways, which could perhaps lead to much

better results. These improvements were mentioned in this work and are directly related to the results

in tests performed. A proper technique of defining which nodes belong or not to a community, which

works for all types of networks would be the most urgent. Currently, Faststep works with a fixed number

of communities. Extending the algorithm to allow it to search or refine the number of communities in

the network could possible extend the usage of algorithm to other interesting problems. Another useful

improvement would be increasing the efficiency of the algorithm, as it takes much a longer time to run

that other algorithms with similar time and space complexities.

In our opinion, the results with Louvain method to obtain a reordering of the graph seem promising

and, as previously suggested, they could benefit if the hierarchical structure of the graph, obtained

internally in the maximization of the modularity, was used in the reordering algorithm. We could also

understand that the clusterings obtained by LLP have, in general, a higher granularity than those obtained

by Louvain method, which presents an opportunity to study how their reorderings could be used together

for an improved result.

56

References

[1] M. Araujo, P. Ribeiro, and C. Faloutsos, “Faststep: Scalable boolean matrix decomposition,” in

Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 461–473, Springer Interna-

tional Publishing, 2016.

[2] M. E. Newman and M. Girvan, “Finding and evaluating community structure in networks,” Physical

review E, vol. 69, no. 2, p. 026113, 2004.

[3] S. Fortunato, “Community detection in graphs,” Physics reports, vol. 486, no. 3, pp. 75–174, 2010.

[4] U. Brandes, D. Delling, M. Gaertler, R. Gorke, M. Hoefer, Z. Nikoloski, and D. Wagner, “On

modularity clustering,” IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 2,

pp. 172–188, 2008.

[5] P. Boldi, M. Rosa, M. Santini, and S. Vigna, “Layered label propagation: A multiresolution

coordinate-free ordering for compressing social networks,” 2010.

[6] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast unfolding of communities

in large networks,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2008, no. 10,

p. P10008, 2008.

[7] R. Albert and A.-L. Barabasi, “Statistical mechanics of complex networks,” Rev. Mod. Phys., vol. 74,

pp. 47–97, Jan 2002.

[8] P. J. M. Mason A. Porter, Jukka-Pekka Onnela, “Communities in networks,” 2009.

[9] P. Boldi and S. Vigna, “The webgraph framework i: Compression techniques,” in Proceedings of the

13th International Conference on World Wide Web, WWW ’04, (New York, NY, USA), pp. 595–602,

ACM, 2004.

[10] M. Araujo, S. Gunnemann, G. Mateos, and C. Faloutsos, “Beyond blocks: Hyperbolic commu-

nity detection,” in Joint European Conference on Machine Learning and Knowledge Discovery in

Databases, pp. 50–65, Springer Berlin Heidelberg, 2014.

[11] G. Golub and W. Kahan, “Calculating the singular values and pseudo-inverse of a matrix,” Journal

of the Society for Industrial and Applied Mathematics Series B Numerical Analysis, vol. 2, no. 2,

pp. 205–224, 1965.

[12] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,”

Nature, vol. 401, no. 6755, pp. 788–791, 1999.

[13] U. N. Raghavan, R. Albert, and S. Kumara, “Near linear time algorithm to detect community

structures in large-scale networks,” Phys. Rev. E, vol. 76, p. 036106, Sep 2007.

[14] P. Ronhovde and Z. Nussinov, “Local resolution-limit-free potts model for community detection,”

2008.

57

[15] A. Noack and R. Rotta, “Multi-level algorithms for modularity clustering,” CoRR,

vol. abs/0812.4073, 2008.

[16] K. Wakita and T. Tsurumi, “Finding community structure in mega-scale social networks,” CoRR,

vol. abs/cs/0702048, 2007.

[17] L. Danon, A. Dıaz-Guilera, and A. Arenas, “The effect of size heterogeneity on community identifi-

cation in complex networks,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2006,

no. 11, p. P11010, 2006.

[18] S. Fortunato and M. Barthelemy, “Resolution limit in community detection,” Proceedings of the

National Academy of Sciences, vol. 104, no. 1, pp. 36–41, 2007.

[19] U. Brandes, “A faster algorithm for betweenness centrality,” The Journal of Mathematical Sociology,

vol. 25, no. 2, pp. 163–177, 2001.

[20] W. W. Zachary, “An information flow model for conflict and fission in small groups,” Journal of

Anthropological Research, vol. 33, no. 4, pp. 452–473, 1977.

[21] “Stanford large network dataset collection.” https://snap.stanford.edu/data/index.html, 2017.

[Online; accessed 2-January-2017].

[22] S. Fortunato, “Benchmark graphs to test community detection algorithms.” https://sites.

google.com/site/santofortunato/inthepress2, 2017. [Online; accessed 6-January-2017].

[23] L. Danon, A. Diaz-Guilera, J. Duch, and A. Arenas, “Comparing community structure identifi-

cation,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2005, no. 09, p. P09008,

2005.

[24] A. Lancichinetti, S. Fortunato, and J. Kertesz, “Detecting the overlapping and hierarchical commu-

nity structure in complex networks,” New Journal of Physics, vol. 11, no. 3, p. 033015, 2009.

[25] A. F. McDaid, D. Greene, and N. Hurley, “Normalized mutual information to evaluate overlapping

community finding algorithms,” Oct. 2011.

[26] “Stanford large network dataset collection, youtube dataset.” https://snap.stanford.edu/data/

com-Youtube.html, 2012. [Online; accessed 2-July-2017].

[27] M. M. Duarte, “Fast hierarchical graph clustering tool on github.” https://github.com/michel94/

fhgc-tool, 2017. [Online; created 13-October-2017].

[28] “Openflights: Airport and airline data.” https://openflights.org/data.html, 2017. [Online;

accessed 7-July-2017].

[29] “Stanford large network dataset collection, amazon dataset.” https://snap.stanford.edu/data/

com-Amazon.html, 2012. [Online; accessed 2-July-2017].

58

https://snap.stanford.edu/data/index.html

https://sites.google.com/site/santofortunato/inthepress2

https://sites.google.com/site/santofortunato/inthepress2

https://snap.stanford.edu/data/com-Youtube.html

https://snap.stanford.edu/data/com-Youtube.html



https://openflights.org/data.html

https://snap.stanford.edu/data/com-Amazon.html

https://snap.stanford.edu/data/com-Amazon.html

[30] “Stanford large network dataset collection, google web graph dataset.” https://snap.stanford.

edu/data/web-Google.html, 2009. [Online; accessed 2-July-2017].

[31] “Stanford large network dataset collection, wikipedia network of top categories dataset.” https:

//snap.stanford.edu/data/wiki-topcats.html, 2009. [Online; accessed 2-July-2017].

59

https://snap.stanford.edu/data/web-Google.html

https://snap.stanford.edu/data/web-Google.html

https://snap.stanford.edu/data/wiki-topcats.html

https://snap.stanford.edu/data/wiki-topcats.html

60

A Shannon Entropy

Entropy is the expected value of information contained in a data source.

If X is a random data source and P (x) is the probability of occurrence of value x, the entropy is given

by:

H(X) = −∑x∈X

P (x)log2P (x)

If Y is a random variable and P(x, y) is probability of occurrence of value x in X and value y in Y , the

entropy of Y knowing X is given by:

H(Y |X) = −∑

x∈X,y∈YP (x)log2P (x, y)

which quantifies the amount of information needed to describe the outcome of a random variable Y given

the value of another random variable X.

61

Documents

Fast Hierarchical Graph Clustering · This study takes also advan-tage of Webgraph, a framework for succinct representation of graphs, and explores how clustering can be ... Hierarchical