Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Fast Hierarchical Graph Clustering
Miguel Martins Duarte
Thesis to obtain the Master of Science Degree in
Informatics Engineering
Supervisors: Prof. Alexandre Paulo Lourenço FranciscoProf. Pedro Manuel Pinto Ribeiro
Examination Committee
Chairperson: Prof. José Carlos Alves Pereira MonteiroSupervisor: Prof. Alexandre Paulo Lourenço Francisco
Member of the Committee: Prof. Francisco João Duarte Cordeiro Correia dos Santos
October 2017
2
Acknowledgements
Firstly, I would like to thank my thesis advisors, Professor Alexandre Francisco and Professor Pedro
Ribeiro for their help and guidance.
I’m also grateful to my friends and colleagues, namely Pedro Paredes, Andre Fonseca and Samuel
Gomes, who were always available to share their experience and understanding.
Finally, I would like to thank my parents and brother, for their encouragement and support along the
way.
3
4
Resumo
Este trabalho estuda como e que metodos de clustering de grafos (ou pesquisa de comunidades em grafos)
podem ser usados para resolver problemas em grafos, tais como clustering hierarquico e reordenacao.
Neste contexto, explora-se como e que o Faststep [1], um metodo recente para factorizacao de matrizes e
clustering de grafos, pode ser adaptado para resolver estes problemas.
Desenvolveu-se um novo metodo, baseado no Faststep, e comparou-se o algoritmo proposto com
metodos actuais de clustering : Layered Label Propagation e metodo de Louvain. Os resultados avaliaram-
se num conjunto alargado de testes, incluido comparacao com comunidades reais conhecidas, redes geradas
por modelos aleatorios e comparacao directa de algoritmos. Este estudo toma partido do Webgraph, uma
ferramenta para representacao sucinta de grafos, e explora como as comunidades podem ser usadas para
melhorar as taxas de compressao obtidas. Concluiu-se que o Faststep, apesar de desenhado para resolver
um problema diferente, pode ainda assim obter resultados aceitaveis neste. Mostrou-se tambem que o
metodo de Louvain, um metodo de clustering hierarquico, pode de facto ser bastante promissor na tarefa
de reordenacao de grafos.
Palavras-chave: Ciencia de Redes, Redes Complexas, Clustering de grafos, Clustering hier-
arquico de grafos, Compressao de grafos
5
6
Abstract
This work studies how graph clustering methods can be used to solve other graph problems, such as
hierarchical clustering and graph reordering. In this context, it explores how Faststep [1], a recent
method for matrix factorization and graph clustering, can be adapted to solve these other problems.
Our study develops a new method for this task, based on Faststep, and compares the proposed algo-
rithm with state-of-the-art clustering methods: Layered Label Propagation and Louvain method. The
results obtained are evaluated by several tests, which include comparison with ground-truth communities,
networks generated by random models and direct comparison of algorithms. This study takes also advan-
tage of Webgraph, a framework for succinct representation of graphs, and explores how clustering can be
a powerful tool to improve the compression rates. We conclude that Faststep, although designed to solve
a different problem, can still obtain acceptable results for this one. We also show that Louvain method, a
hierarchical clustering method, can in fact obtain really promising results in the graph reordering task.
Keywords: Network Science, Graph Clustering, Hierarchical Graph Clustering, Graph Com-
pression
7
8
Contents
Resumo 5
Abstract 7
List of Tables 11
List of Figures 12
1 Introduction 15
1.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2 Background 19
2.1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.1 Graph representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 FastStep Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.1 FastStep formal goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.3 Obtaining the clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Label Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Extensions to Label Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Modularity Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.3 Louvain Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5 Classical Hierarchical Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5.1 Betweenness-based divisive algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Approach 31
3.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.1 Initial validation of the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.2 Comparison Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.3 Succinct representation as clustering metric . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Graph Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Faststep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 Louvain Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.3 Layered Label Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Graph Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9
3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.1 Recursive Faststep implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.2 Command Line tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4 Results 41
4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Detection of communities in artificial networks . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Detection of communities in networks with ground-truth communities . . . . . . . . . . . 43
4.4 Comparison of clusterings obtained with different methods . . . . . . . . . . . . . . . . . . 47
4.5 Compression Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5 Final Remarks 55
References 57
A Shannon Entropy 61
10
List of Tables
1 Successor lists for graph 1a, using Webgraph format. . . . . . . . . . . . . . . . . . . . . . 21
2 How to use the fhgc tool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3 Datasets used: source and information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4 Size of datasets after reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5 Space savings after reordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6 Execution time of the algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
11
12
List of Figures
1 Examples of graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Adjacency matrix of a community found by FastStep. . . . . . . . . . . . . . . . . . . . . 21
3 Faststep values for a real network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4 Benchmark results for the NMI metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Benchmark results for the Jaccard metric . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6 Detection of ground-truth communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7 Detection of ground-truth communities: relation of the community size with detection
quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8 Clustering comparison using NMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
9 Clustering comparison using Jaccard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
13
14
1 Introduction
1.1 Problem
Graph clustering or community finding is an important task to further understand real networks. Com-
munity finding can be seen naturally as finding groups of vertices in a network with denser connections
between themselves and sparser connections with other elements of the network. This problem is compu-
tationally hard in general, and several greedy methods, with origins in multiple fields, have been proposed
to solve it. These algorithms are usually based on general assumptions on community structure and are
evaluated using ground-truth clusterings, i.e., graphs where communities are already known.
For a more formal goal, we can consider quality measures, with the most commonly used being the
modularity [2]. Some algorithms are designed specifically to maximize some metric. For a more complete
discussion on community finding we refer the reader to a review by Fortunato [3].
The complexity of the problem depends on the specific definition of clustering and it is tightly related
with the quality measure we are optimizing. In general, this optimization is an NP-Hard problem [4].
Also, it is not usually possible to obtain an approximate solution which is within defined quality bounds.
Hierarchical graph clustering is a similar problem, but in this case we are looking for a hierarchy of
clusters, i.e., how larger communities can be divided into smaller ones, and how each of these smaller
ones can also be divided, and so on until we reach an undivisible community. Some of the algorithms
used for the non-hierarchical problem already use an agglomerative or divisive approach, which implicitly
generates an hierarchy. Non-hierarchical algorithms can somehow be used for this task, if there is a
parameter which allows us to define the granularity of the communities to be detected.
Another important task is clustering refinement. Clustering refinement is a method that, given the
obtained clustering, moves nodes between clusters to further improve the communities already obtained.
A major concern of the research in this subject is time efficiency. Real networks can have several
millions of nodes, which demands a linear or near-linear time algorithm. Layered Label Propagation [5]
and Louvain method [6] are two greedy methods that can solve the clustering problem in near-linear time
and provide solutions with reasonable quality.
A first goal of this work is to study how FastStep [1], a recent method for boolean matrix factorization,
can be used for graph clustering. It was designed to factorize a large, sparse matrix, in a multiplication of
two smaller matrices with a fixed width, but keeping the error as low as possible. This matrix factorization
can also be interpreted as a clustering method, with a fixed number of communities.
In fact, FastStep is a method which is able to find rich community structures and can run in near-linear
time. This method takes a parameter k, which can be used to define the granularity of the communities
intended.
A second goal is to study how to apply it to the hierarchical community finding problem and test
how it can be used for community refinement, based on the ideas used in the Layered Label Propagation
method [5]. This application is also directly related to graph reordering, which will also be addressed in
this context.
A third goal is to integrate multiple community finding algorithms into a common framework for
15
hierarchical graph clustering.
1.2 Context
Network science is a field with lots of interesting developments in this century, with graphs being essential
tools for the study of many complex systems. It has been found that real networks have many common
structural properties [7]. The distribution of edges per vertices is heterogeneous. Usually, the degree
distribution follows a power-law, with many nodes with few connections and few nodes with a lot of
connections. Also, this distribution of edges seems to be organized in communities [8]: higher density
of edges within a group of nodes, with less connections to the elements outside of the group. This
concept of community or cluster is independent of the origin of the graph; it appears to be an essential
element in the structure of real networks. Society, for example, has multiple organizations of people:
families, work groups, villages or countries. We can also find this structure in biology (food webs, protein-
protein interaction networks, metabolic networks), computer science (social networks, customer-product
relations), politics, economics, etc.
Developing methods for community finding is crucial for many applications. These methods, besides
having their use for specific situations, are tools that can help in the further understanding of real networks
and their structure.
Another important task is to find not only the communities of a graph, but also their hierarchy. In
a lot of networks, it is possible to detect not only its communities, but how a community includes some
smaller ones [8]. We can see, for example, the community of people working and studying at a school
and how it relates with the community of people living in that town. It can also be interesting to devise
a way of dividing this school group into three communities (teachers, students and staff).
A large number of methods for graph clustering and hierarchical graph clustering have already been
introduced. If the reader is interested, the review by Fortunato surveys most of the classical approaches
to the problem [3]. Currently, there is some focus on solving the problem for really large networks.
Algorithms that take more than linear time are usually unusable in these situations. With big data being
a major concern nowadays, real networks with millions of nodes and billions of edges present a difficult
challenge that certainly needs to be tackled. Layered Label Propagation [5] and the Louvain method
[6] are two eficient algorithms with good and known implementations, which work for reasonably large
networks. However they may return suboptimal clusterings and alternative or slower methods may get
better results.
FastStep [1] is a method for finding a factorization of a boolean matrix, which offers better inter-
pretability. FastStep is, in particular, able to find structures that go “beyond blocks”, providing more
realistic representations. It can be applied directly to adjacency matrices (which are boolean matrices).
The ability to find rich community structures, among other interesting qualities, shown by its authors,
together with a parameter which allows to define the granularity of the communities we are looking for,
makes it a promising algorithm to be used in the search for hierarchical communities.
One important application of clustering is graph compression or succinct representation of a graph.
As described in [9], it is possible to obtain a smaller representation of the graph after a reordering of
16
its vertices. Information about communities or even better, hierarchical communities, can be helpful to
increase the compression rate. Based on this, it is possible to infer the quality of a clustering algorithm
just by analysing how much the graph can be compressed using this information. We will also follow this
approach in the evaluation of our results.
1.3 Document Outline
Section 1 describes the problem we are proposing to solve and gives some context about the most relevant
work done in this field. Section 2 starts with basic notation when working with graphs and describes
clustering algorithms. Includes classical methods and efficient algorithms, which are the main focus of
this work. Section 3 explains the solution chosen for the problem and the reasons behind it. It also
presents how the solution is evaluated. Section 4 presents the results obtained for the experimental tests,
defined in the previous section, and makes a detailed analysis of them. Section 5 concludes the document,
with general observations on the results obtained and on possible future work.
17
18
2 Background
2.1 Graphs
A graph is a pair (V,E) formed by the a set of vertices V and a set of edges E. In this work, we define
n = |V | and m = |E|. Each edge is a pair (i, j) with i, j ∈ V . In an undirected graph, an edge (i, j) is
the same as the edge (j, i). In a directed graph, it means that there is a connection only from i to j. A
multigraph is a graph that can have multiple connections between i and j.
A community or a cluster can be defined in a simplified way as a set of vertices of the graph with
denser connections between each other and less connections between other vertices of the graph. For
a deeper review on graph clustering, check [3]. Communities are associated with the natural notion of
a group, which can have a complex and intricated structure. A more concrete definition of a cluster
is difficult to give, because it is related with understanding the structure of real network, which is a
problem that is far from being solved. Usually, a community is formalized by defining a specific quality
measure for graph clustering, that is then optimized by some method. Also, a wide variety of problems
that involve finding structures in a graph are NP-Hard, and graph clustering is one of those problems.
Even when defining an exact quality measure to be optimized, it cannot be solved efficiently by current
computers and must be solved heuristically or in a greedy way.
(a) An undirected graph. (b) A directed graph.
Figure 1: Examples of graphs.
2.1.1 Graph representation
Usually, graphs are stored using adjacency matrices or adjacency lists. Some alternative formats can also
be used when space efficiency is a concern.
Adjacency matrices are matrices of size |V |. In an adjacency matrix M , Mij represents the con-
nection between node i and node j. If Mij = 0, then there is no edge between i and j. If there is an edge
between vertices, then Mij = 1. For weighted graphs, the definition can be extended to Mij = k, were k
is the weight of the edge that connects i to j. If a graph is undirected, then Mij = Mji, i.e., M = MT .
The use of an adjacency matrix allows to check if there is a node between any two edges in constant time.
The space complexity of storing this data structure is O(n2), which might not be possible when dealing
with large sparse graphs.
0 1 1 0
1 0 1 0
1 1 0 1
0 0 1 0
(2.1)
Adjacency matrix of graph 1a
19
0 1 1 0
0 0 1 0
0 0 0 1
0 0 0 0
(2.2)
Adjacency matrix of graph 1b
Adjacency lists store, for each vertex i, a list of its neighbours, i.e., the nodes connected to i by
one edge. We can augment these lists with weights if we are working with a weighted graph. The space
complexity of this structure is O(n + m). Checking for an edge between any too nodes can take O(n)
time. We can sort each of the lists, in that case the bound can be lowered to O(log n). Using a hashtable
for each node, it is possible to further reduce the time complexity to O(1), without increasing the space
complexity. Usually, adjacency lists are a good option for large sparse networks.
Compressed Sparse Row (in short, CSR) is a format to store a sparse matrix in O(n+m) space
and, for graphs without weights (i.e., matrices with zeros and ones), uses only two arrays. The first one
(C), of size O(m) stores the indexes of columns in row-major order (from left to right, top to bottom).
The other array (R), of size O(n), stores, for each position i, the position where, in the first array, the
elements of row i start.
For the graph 1a, we can store it in CSR as:
C: 0 2 4 7 8
R: 1 2 0 2 0 1 3 2
Webgraph is a compression format for succinct representation of graphs proposed by Boldi and
Vigna [9]. The format can be seen as an improvement of adjacency lists, and is based on two principles
of web graphs - networks formed by the pages and links in the web:
• Locality - A lot of the urls of pages are usually within some website and used in the context of
navigation, and share a long common prefix
• Similarity - Pages that are near in lexicographically order tend to be similar and have a lot of links
in common
The first principle is implemented by storing the gaps between indexes of nodes in the adjacency list,
instead of storing the indexes themselves. In this way, it is possible to save space in the storage, by using
less memory for each integer than needs to be stored, as they are in fact smaller.
Let us say the successor list of node x is represented by S(x) = s1, s2, s3, ..., sk, where si < si+1. The
indexes si can be quite large, but based on principle of locality, we expect si+1 − si to be smaller. So,
S is stored as s1 − x, s2 − s1 − 1, s3 − s2 − 1, ..., sk − sk−1 − 1. The second principle is implemented by
allowing the successors list of one node to be an incomplete copy of the successors list of another node.
For that, it uses a sets of bits, where a bit i set to 1 represents the copy of successor i. It uses also an
extra list, for other successors not in the ”copied” list.
20
Index Original S Smaller S1 {2, 3} {1, 0}2 {1, 3} {−1,−1}3 {1, 2, 4} {−2, 1, 2}4 {3} {−1}
Table 1: Successor lists for graph 1a, using Webgraph format.
The following table presents an example of the successor lists that we can obtain for graph 1a. For
simplification, copies are not being considered.
2.2 FastStep Algorithm
FastStep is an algorithm for matrix factorization proposed by Araujo et al. [1]. It factorizes a boolean
matrix into two matrices with smaller dimensions and non-negative elements. It allows to understand
the underlying structure of a matrix, and can be helpful for compression, prediction and denoising. The
algorithm gives an approximated result with low error and runs in near-linear time on the number of
non-zero elements.
Figure 2: Adjacency matrix of a community foundby FastStep.
An important feature of this method is the abil-
ity of finding rich community structures. The re-
sults obtained experimentally allow to obtain hy-
perbolic clusters, which, as shown in a recent work
by Araujo et al. [10], is a good approximation to
the internal organization of real communities. Fig-
ure 2 shows a community found by FastStep. Com-
mon algorithms work only in the specific cases of
finding rectangular like structures in the matrix,
and that is the reason why the authors present it
as a method that goes Beyond Blocks.
Another interesting property of FastStep is the
strong interpretability of the boolean matrix de-
composition. The use of non-negative factors in
the matrices allows to establish the importance of elements and is in fact what enables representations of
non-block structures. Also, the boolean reconstruction allows clear predictions and explanations of the
non-zeros.
2.2.1 FastStep formal goal
Let M be a n × m boolean matrix. The goal is to find a n × r non-negative matrix A and a m × r
non-negative matrix B, so that the product ABT is a good approximation of M after thresholding.
21
minA,B||M − uT (ABT )||2F =
∑i,j
(Mij − uT (ABT )ij)2 (2.3)
where ||.||F is the Frobenius norm and uT (X) simply applies a step function to each element Xij .
[uτ (X)]ij =
1 if Xij ≥ τ
0 otherwise
(2.4)
The thresholding operator renders the objective function non-differentiable. In order to solve it, the
function can then be approximated by another with similar objective:
minA,B
∑ij
log
(1 + e
−Mij×(
r∑k=1
AikBjk−τ))
(2.5)
Here, M is a matrix with values in {−1, 1}, where zeros are represented by −1 and ones by 1. This
function is the objective function used by Faststep.
The objective function can then be optimized in several ways. The authors used the gradient descent
method. Let us consider Sij =r∑
k=1
AikBjk. The gradient of the objective function, for Aij , is then given
by:
∂F
∂Aik=
m∑j=1
Bjk1 + eτ−Sij
−∑j∈Mi
Bjk (2.6)
The update rules for B are similar.
It might be important to refer some implementation details. Matrices A and B are projected after
each iteration, and projected to a small value ε instead of 0, as A = B = 0 is a stationary point of the
objective function and there would be no improvement. The τ variable assumed a value of 20 in the
authors implementation, because it allowed to achieve good results.
2.2.2 Complexity
A straightforward implementation of the algorithm would take O(Tnmr2), where T is the number of
iterations needed, n and m are the dimensions of the boolean matrix and r is the rank of the decomposition
(width of both A and B, as stated previously).
Using additional O(nm) memory to store S and updating it every iteration allows to reduce the
complexity to O(Tnmr).
This algorithm runs in squared time in relation to the size of the matrix; grows linearly in O(nm).
For a sparse matrix, it is possible to introduce some approximations to the algorithm that further reduce
its complexity.
As we can see in equation 2.6, calculating the gradient for each element of the matrix Aik requires
O(nm) time. Still, the gradient can be approximated to have a total number of operations in the order
O(E), with E being the number of non-zero elements in the matrix. Real networks are generally sparse:
the number of edges is in the order of magnitude of the number of vertices. Also, This approximation
introduces some error in the decomposition, but it is considerably low.
22
Briefly, the first part of the gradient is a sigmoid, and for the value to have a significant impact on the
gradient it is needed both a high Sij and a high Bjk. So, pairs (i, j) with high AikBjk are first considered,
as they mean high value in both of the sigmoid parameters. As such, equation 2.6 can be approximated
to:
∂F
∂Aik≈
m∑(i,j)∈P (t)
Bjk1 + eτ−Sij
−∑j∈Mi
Bjk (2.7)
where P (t) is the set of elements which are considered to introduce greater gradient variation; r sets
Pk(t) need to be kept for the calculation of each factor Aik of the gradient.
It is also important to note that, initially, only non-zeros contribute to the gradient, which means it
can be calculated using only the second summand of equation 2.7. As the iterations progress, the error
will also move to some of the zeros. However, given M’s sparsity and the symmetry of the error function
– the error of misrepresenting a one is the same as misrepresenting a zero – |P (t)| can be kept small and
in the order of O(rE).
To quickly find the top-t pairs (i, j) of P with highest AikBjk, let ak and bK be columns k of matrices A
and B, respectively. After sorting ak and bk, the biggest AikBjk not currently in Pk can be selected from
a very small set of elements in the matrix. Therefore, one can keep a priority queue with O(min(n,m))
elements and it is possible to select a set of t non-zeros and approximate the gradient of all elements in
factor k in O(t+ n log n+m log m) operations.
In order to detect convergence and after each iteration of the gradient descent, an estimate of the
error F (A,B) is calculated by considering all the non-zeros and an uniform sample of the zeros of the
matrix. The error is then scaled accordingly.
With these modifications, the total complexity can be revised to O(Tr(E+P log(min(n,m))+n log n+
m log m + S)), where S is the number of samples used to check convergence. When our matrix is the
adjacency matrix of a graph, the complexity is:
O(Tr(E + (P + n) log n+ S)) (2.8)
As we can see, the algorithm scales linearly with E, the number of non-zeros. However, when the
number of communities we want to find gets closer to n, the complexity of the method becomes quadratic
on n. Also, there is no result that states the independence of T and r, as the time the algorithm takes
to converge might depend on the size of the matrix we want to obtain.
2.2.3 Obtaining the clusters
For community detection, i.e., to find if an element belongs to a factor k, it is possible to do it directly
based solely on the principles of the decomposition.
For this specific purpose, we can replace the matrices A and B for one single matrix A, and use
the matrix A instead of B in all the steps of the algorithm (i.e., B = A). This approach is acceptable,
as we are looking for communities, and communities are considered undirected. The gradient and its
minimization remains similar.
23
As such, a row element i belongs to a factor k if there is a non-zero in the reconstructed matrix in
row i and if this factor contributed with a weight above τr :
Aik ≥τ
r max(bk)and Si,argmax bk ≥ τ (2.9)
Running the algorithm with different values of the rank r, makes it possible to obtain communities with
a different granularity.
2.2.4 Results
The authors show two empirical results on real data. The error obtained by their tests presented lower
squared error than other scalable methods: SVD [11], NNMF [12] and HyCom [10]. The algorithm scales
linear in the number of non-zeros, as proven previously. In the context of graph data, which is the
concern of the present work, we know that non-zero elements are the edges of the network, which means
the algorithm scales near-linearly with the number of edges. It depends also on the number of iterations
needed for it to converge, which is assumed to be small.
2.3 Label Propagation
Label Propagation [13] (LP) is a near-linear time algorithm for finding communities in a graph. Its
time complexity makes it useful for large networks, where algorithms that take more than linear time
take too long to run. The method of the algorithm is the following:
1. Initialize each node of the graph with a distinct label;
2. Update each node label to the most frequent label found in its neighbours. Ties are broken uniformly
at random;
3. Repeat 2 until it stabilizes, i.e., there are no more updates to the labels of the graph;
According to the authors, very few iterations are usually needed for the algorithm to stop. In fact,
each iteration takes linear time, but no bound has been proven to the number of iterations needed, only
empirical results indicate that.
The communities are formed by nodes of the same label. The algorithm can only find non-hierarchical
and non-overlapping clusters. Due to the way ties are broken, successive runs of this algorithm produce
different partitions. The authors proposed a way to use successive runs of the algorithm, followed by an
aggregation method, to obtain a more informative clustering. The aggregation algorithm is quite simple:
nodes keep a set of all the labels obtained in the successive runs of the algorithm; clusters are formed by
the nodes that have the exact same label set. This aggregation method might provide a lot of fine-grained
communities if many iterations of the algorithm are run.
The Label Propagation algorithm, even though it has a very good performance, tends to include
the majority of the nodes in the same community, due to the inherent topology of real networks. That
problem gave rise to several Label Propagation algorithms that keep the original idea and change the
update rule of the labels.
24
2.3.1 Extensions to Label Propagation
One of the most interesting extensions to LP is the absolute Potts Model algorithm (APM) [14]. In this
algorithm, let us consider, for a given node x, that its neighbours have labels λi, with i between 1 and
K, K being the number of distinct labels in the neighbourhood of x. There are ki neighbours with label
λi. In the simple LP algorithm, the new label λi is chosen such that ki is maximum. APM, instead,
maximizes the following expression:
ki − γ(vi − ki), (2.10)
where vi is the number of nodes in the whole graph with label λi currently assigned.
Another important work based on this method is the Layered Label Propagation (LLP) [5], an
algorithm for matrix diagonalization or graph reordering.
Large networks require special storage concerns, as they usually cannot fit in the memory of a computer
with standard resources. To reduce the space used to store vertex indices, distances between indices are
stored instead. To obtain a better encoding, vertices are reordered to minimize these distance values. This
way, in a adjacency list, we expect to store smaller values, using less space. Intuitively, we can understand
how clustering can be helpful to solve this problem. LLP is an algorithm to do this reordering with this
goal in mind.
The ordering obtained by LLP is used in the Webgraph library, which, among other purposes, can be
used as an efficient framework for graph compression [9]. Webgraph obtained really good results using
this approach.
This algorithm uses, as a starting point, the APM algorithm. As presented by the authors, APM
presents two problems: the parameter γ is hard to estimate by common graph measures, such as size
or density, and the clusters obtained by the algorithm usually follow heavy-tailed distributions, yielding
some really large clusters and a huge number of small ones.
LLP works by running successive iterations of APM, with different γ values, and storing an ordering
of the nodes using both the ordering of the previous iteration and the clustering obtained in the current
iteration. This ordering is defined formally by:
x ≤k+1 y iff
πk(λk(x)) < πk(λk(y)) or
λk(x) = λk(y) ∧ πk(x) ≤ πk(y)
(2.11)
where λk is the resulting labelling function from iteration k and πk is the index of the node in the ordering
obtained at iteration k.
In other words, elements with different labels are ordered with respect to these labels, elements with
the same label keep the ordering of the previous iteration.
Several choices for the different γk values can be made. The authors obtained better results by using a
random γk, chosen uniformly from the set {0} ∪ {2−i, i = 0, ...,K}. They also tried to use the same value
γ for all the iterations, but the results were always worse. In our point of view, this might indicate that
combining communities of different granularity is a way of getting a better compression, which implies
that a good hierarchical clustering method can be used to obtain an even better compression rate.
25
2.4 Modularity Maximization
Modularity is a metric introduced by Girvan and Newman [2] to measure the quality of a clustering.
Initially, defined only for unweighted networks but, in a later work, also defined for weighted ones. Louvain
method, that we will present next, works with multigraphs, which can be represented as weighted graphs.
Therefore, we will use the definition using weighted edges to cover that case too.
This metric is based on the idea that a random graph does not have a community structure. For
that, we consider a null-model, a new graph where the degree distribution is the same as the original
graph, but the edges are rewired in a random way. The existence of a community can be discovered by
comparing, for a specific subgraph, the expected density of edges and the real density of edges. Naturally,
if the density of edges within a subgraph is much higher than expected, it means we might have found a
community.
2.4.1 Definitions
Let us consider a function f : V × V → N , that assigns a non-negative edge weight to each pair of
vertices of the graph. For unweighted graphs, can be defined as one when there is an edge between the
two nodes and as zero otherwise. For multigraphs, we consider the number of edges between the two
nodes. We consider also deg(v) =∑u∈V f(u, v), which represents the degree of the node. f and deg can
be generalized for a set of vertices V , where f(V, V ) =∑u∈V,v∈V f(u, v) and deg(V ) =
∑v∈V deg(v). A
graph clustering C = C1, C2, ...Ck partitions the vertices into k disjoint non-empty subsets Ci ⊆ V .
The expected number of edges within a community C is given by:
deg(C)2
deg(V )2(2.12)
which is the expected fraction of edges of the graph that are within C, i.e., the fraction of edges of the
graph that are within C in the null-model.
The real number of edges within a community C is given by:
f(C,C)
f(V, V )(2.13)
which is the fraction of edges of the original graph that are within C.
The modularity of a clustering C is the sum, for all considered communities, of the difference between
the real number of edges (2.13) and the expected number of edges (2.12). It is given by:
QC =∑C∈C
(f(C,C)
f(V, V )− deg(C)2
deg(V )2
)(2.14)
2.4.2 Algorithms
Maximizing the modularity of a graph is an NP-Hard problem [4]. Several algorithms were proposed to
solve the problem in a greedy way. As an effort to organize existing solutions into a coherent design space,
Noack and Rotta [15] define two types of algorithms for modularity maximizations: greedy coarsening
26
algorithms and refinement algorithms.
A greedy coarsening algorithm starts from singleton clusters and merges clusters iteratively, choos-
ing the merge with highest modularity increase. It can produce reasonable results. However, it was shown
to be biased towards merging larger clusters [16], [17]. Several modifications to the algorithm were pro-
posed, with other priority criteria for merges and with changes on the purely greedy merge strategy. The
results obtained by these coarsening algorithms can be further optimized by refinement algorithms.
Refinement algorithms iteratively move individual nodes between clusters. Here, we can choose the best
move in each iteration, i.e., the one with highest modularity increase. This approach might be heavier
to compute. However, choosing the moves by an arbitrary order is much faster and not necessary less
optimal.
For a review and proper comparison of these alternative approaches and different coarsening and
refinement techniques, check [15].
The modularity optimization may fail to identify modules smaller than a scale which depends on the
total number of links of the network, as shown by Fortunato et al [18]. As such, it is considered that the
modularity optimization methods have a resolution limit: communities smaller that some specific size,
which varies with the graph considered, may not be found by the methods and these communities are
therefore included in larger ones.
2.4.3 Louvain Method
The Louvain Method [6] is a simple method for hierarchical graph clustering, based on the optimization
of the modularity.
The algorithm starts with a weighted network of N nodes, and each node i belongs to a different
community Ci.
1. For each node i in the network
1.1. For each neighbour j of i
1.1.1. move node i to community Cj , i.e., assign Ci to Cj , if the modularity gain of this move is
positive
2. Merge the nodes inside each community into one single node
2.1. The edges between nodes of the community become a self-loop, with weight equal to the sum
of the weight of all those edges
2.2. All other edges are kept and merged if their endpoints are the same.
3. If step 2 made any change in the community structure, proceed to step 1. Otherwise, the algorithm
stops.
The modularity gain from moving an isolated node i into community C is given by:
∆Q =
[f(C,C) + f(i, C)
2m−(deg(C) + deg(i)
2m
)2]−
[f(C,C)
2m−(deg(C)
2m
)2
−(deg(i)
2m
)2], (2.15)
27
where m is the sum of the weights of all the links in the network. A similar expression can be derived
for the modularity change when node i is removed from community C. The algorithm can then calculate
the overall gain by removing i from its initial cluster and moving it to one neighbouring cluster.
Usually, the first pass (running step 1 followed by step 2 in the beginning of the algorithm) is the
heaviest computational task and takes most of the computing time. The method can then run in near-
linear time, if we can consider that the degree of a node is constant and does not depend on the size of
the graph. The number of hierarchy levels resultant from the algorithm is small, which implies few steps
of the algorithm to reach its conclusion.
It is also important to note that the order in which nodes are iterated in step 1 of the algorithm is
relevant to obtaining different results. Even though the order does not seem much relevant to obtain
clusterings with higher modularity values, it can have an impact in the computation time of the algorithm.
Several different tests were made by these authors [6] that show results with both high precision and
good execution times. The algorithm managed to run in 152 minutes for a large network, with 118 million
nodes and 1 billion edges.
The authors argue that the algorithm might partially avoid the resolution limit of modularity maxi-
mization, because it is highly unlikely that, in step 1 of the algorithm, all nodes from one community are
moved to another. In step 2, clusters are merged together, but the smaller ones that were combined are
kept deeper in the hierarchy.
2.5 Classical Hierarchical Algorithms
Classical hierarchical algorithms follow two basic approaches: agglomerative and divisive. These algo-
rithms allow to retrieve a dendrogram which represents the entire hierarchy obtained.
Agglomerative algorithms merge clusters iteratively, joining together communities with higher
similarity, until the number of clusters desired is achieved or only one cluster remains.
Divisive algorithms remove edges of the graph, creating disconnected communities, which become
smaller and smaller as the algorithm runs. The edges of the graph that are removed are chosen by some
metric which tries to separate nodes with less similarity.
2.5.1 Betweenness-based divisive algorithm
An interesting divisive algorithm was proposed by Newman and Girvan [2]. In their approach, the next
edge to be removed is chosen using the edge with the highest value of the betweenness metric, in three
different variations suggested by the authors: edge betweenness, random-walk betweenness and current-
flow betweenness.
Edge betweenness of an edge is the number of shortest paths, between any two nodes of the graph,
that contain that edge. Random-walk betweenness is the same as the edge betweenness, but where a
random-walk from the source to the destination is considered, instead of a shortest path between every
two nodes. So, for each random-walk between any two nodes, there is some probability of choosing a
path that includes some edge e. The value of the random-walk betweenness for edge e is the sum, for
each possible pair of nodes (n1, n2), of the probability of e being included in a random-walk between n1
28
and n2. Current-flow betweenness is based on ideas from electronic circuits. It considers each edge
of the graph as an unitary resistor and each pair of nodes as a source and a sink. Electricity flows in a
larger quantity in shorter paths. We can then obtain the betweenness value for each edge e by adding the
flow obtained, for all possible pairs of (source, sink). The authors prove that random-walk betweenness
is equivalent to current-flow betweenness.
The divisive algorithm proposed works as follow:
1. Compute betweenness values for all edges present in the graph (using the chosen betweenness
metric)
2. Remove edge of the graph with highest betweenness value
3. Verify stopping criteria. If not achieved yet, repeat from step 1
It is also important in the algorithm to recalculate the betweenness after removing an edge, because
successive deletions of edges of the graph can modify completely the values of the edges.
This step is a costly one. The edge-betweenness can be calculated using the algorithm proposed by
Brandes [19] which runs in O(mn) time. This algorithm is then executed at most m times, after m
removals of edges, and will result in a O(m2n) time complexity. Removing m edges can be done in at
most O(mn) time, which does not dominate the running time, given the time needed to compute the
betweenness.
29
30
3 Approach
As stated before, the focus of this work was to study if FastStep can be used to find hierarchical commu-
nities and/or refine communities, while comparing its results with LLP and Louvain method. This was
the major goal of the project.
A first task consisted of understanding how Faststep can be used as a graph clustering tool. Obtained
clusterings could then be evaluated using common metrics. Measuring the quality of a clustering algorithm
is addressed in the subsection 3.1. An important part of the evaluation is also the comparison of the
results with the ones obtained by reference algorithms. We compared Faststep with Louvain method
and Layered Label Propagation. Section 3.2 explains how Faststep was modified with this objective and
compared with the other methods.
A second task is related with the idea that knowledge about the communities of a network can be used
to reorder its vertices and achieve better compression of a graph, when using a framework like Webgraph
[9]. Therefore, for a better evaluation of the clusterings, we used them for graph compression, and the
compression rates were compared for the same reference algorithms. Section 3.3 explains how reordering
can be obtained from the used methods.
A secondary goal of the project was the creation of a framework for graph clustering. It aggregates
the developed method and also LLP and Louvain method, our reference algorithms. The user of this
framework has the possibility of choosing any of the available methods. This tool outputs a clustering or
a reordering for the provided network. It also includes some functionalities to ease the task of comparing
and evaluating clusterings and/or reorderings of the graph.
3.1 Evaluation
Testing the quality of a clustering algorithm was an essential step of this work. To make sure the developed
approaches are useful in practice, they must be validated systematically.
Complementary to the quality tests, we will finish with a complete benchmark of time used by the
developed method and by the reference algorithms.
3.1.1 Initial validation of the results
One simple way of asserting the quality of an algorithm is running the method with datasets where
communities are known. In a first evaluation phase, it is common to use small networks where we can
easily see how the algorithm did and understand why it might have failed. Zachary’s karate club [20] is
a small and classical example that is often used.
We used small graphs as a first validation for our work, but tests using large networks are of utmost
importance, as we wanted to make sure that our methods behaved well, not only on small instances,
but also on large real examples. SNAP [21] provides a collection of social networks with ground-truth
communities identified. These networks have a number of nodes ranging from hundreds of thousands to
millions of nodes. Another good source of networks with communities is the set of benchmarks provided
by Fortunato et al [22]. These tools allow to generate graphs and can be used to systematically create
31
tests for the developed methods.
Another way of evaluating results is by comparing them directly with other known and tested al-
gorithms. The Louvain method and Layered Label Propagation are the obvious target algorithms for
clustering validation. In these tests, we can use any network, as known communities are not necessary.
3.1.2 Comparison Metrics
It is important to note that both evaluation methods require the comparison of the clustering obtained by
the algorithm in test with either ground-truth communities or a clustering obtained by another algorithm.
Comparing different clusterings is also an interesting problem. One simple way of doing it is using the
Jaccard index. The Jaccard index evaluates the similarity of two sets. Considering two sets A and B,
Jaccard Index =A ∩BA ∪B
(3.1)
When comparing two clusterings A and B, we want to find, for each cluster C in A, the cluster
in B which has smaller jaccard distance with C. This metric allows to evaluate how well each cluster
is detected between two different clusterings. When looking for an overall metric, we can average the
smallest jaccard distances for each cluster of A. In this work, when referring to the Jaccard Index in the
context of the comparison of clusterings, it is assumed that we are considering this average. One could
also argue that a variance should also be shown when averages are performed. In our point of view, the
Jaccard Distance is used just as a control measure, as we use another metric, which is shown next, to do
a proper comparison of clusters. Showing also the variance would not help the reader in the analysis of
the results, as it would probably be excess of information. Also, we can have an idea of the distribution
of the Jaccard Distances in subsection 4.3.
We can also use a specific metric of similarity to compare two clusterings. One of the most commonly
used is the Normalized Mutual Information (NMI).
NMI is based on Information Theory principles. The main idea is to consider that if two clusterings are
similar, then we need little information to infer one clustering from the other. Let us define the two dif-
ferent partitions (or clusterings) of the graph as X = (X1, X2, X3, ..., XnX) and Y = (Y1, Y2, Y3, ..., YnY
),
where Xi and Yj are clusters/sets of vertices in X and Y, respectively. n is the number of nodes of the
graph and nij is the number of nodes shared by clusters Xi and Yj . The community assignments {xi}
and {yi} define the cluster to which node i belongs in clusterings X and Y. We then consider the labels x
and y as values of two random variables X and Y , with distribution P (x, y) = P (X = x, Y = y) = nxy/n,
which implies P (x) = P (X = x) = nXx /n and P (y) = P (Y = y) = nYy /n. We can then obtain the mutual
information of the clusterings X and Y , which is given by I(X,Y ) = H(X) −H(X|Y ), where H is the
Shannon entropy. Check Appendix A for an entropy definition.
Using the mutual information as a similarity measure doesn’t work that well, because all partitions
obtained from X by further partitioning its clusters have the same mutual information with X (even the
one where each node has its one cluster). Danon et al. [23] proposed the normalized mutual information
to avoid this problem:
32
Inorm(X ,Y) =2I(X,Y )
H(X) +H(Y )(3.2)
We will use NMI in the comparison of different clustering for the evaluation of the developed algorithms.
The only problem with the use of NMI is that it doesn’t work for ground-truth communities where
nodes can belong to more than one cluster. Lancichinetti et al. [24] presented an extension for the NMI,
which was further improved by McDaid et al. [25].
When extending the NMI to overlapping clusters, the previously used notation cannot be used directly,
as xi and yi assume that there is a direct assignment of a vertex to a cluster, which does not happen for
this problem, as a node can belong to more than one cluster.
Clusterings use another notation in this situation. Each vertex v has an associated binary array of
size |P |, where each k-th bit, (xv)k, is either 1 if v belongs to cluster k or 0 otherwise. The random
variable Xk denotes de random variable associated with the k-th bit and partition P . The probability
distribution of Xk is:
p(Xk = 1) =nkn, (3.3)
p(Xk = 0) = 1− nkn, (3.4)
where nk is the number of nodes in cluster Ck ∈ P , i.e., nk = |Ck|, and n is the total number of
nodes, n = |V |. We define, for a second partition P ′, (yv)k, Yk and C ′ in the same way as (xv)k, Xk and
C, respectively. The joint probability of Xk and Yl is given by:
p(Xk = 1, Yl = 1) =|Ck ∩ C ′l |
n(3.5)
p(Xk = 1, Yl = 0) =|Ck| − |Ck ∩ C ′l |
n(3.6)
p(Xk = 0, Yl = 1) =|C ′l | − |Ck ∩ C ′l |
n(3.7)
p(Xk = 0, Yl = 0) =n− |Ck ∪ C ′l |
n(3.8)
We can also define the conditional entropy between Xk and Yl
H(Xk|Yl) = H(Xk, Yl)−H(Yl) (3.9)
and, since we are interested in the best possible matching between P and P ′, we can select the Yl
that minimizes the entropy,
H(Xk|Y ) = min H(Xk|Yl) (3.10)
We can then take the normalized equation 3.10 and calculate its average for all Xk distributions:
H(X|Y )norm =1
|P |∑k
H(Xk|Y )
H(Xk)(3.11)
33
H(Y |X)norm can be defined in the same way. We finally define the normalized mutual information,
as presented by Lancichinetti et at. [24]
ONMILFK(X|Y ) = 1− 1
2(H(X|Y )norm +H(Y |X)norm) (3.12)
McDaid et al. [25] argue that this definition of the NMI overestimates the similarity of the clusterings.
When considering two partitions P and P ′, if P has only one cluster which is exactly equal to one cluster
of P ′, we would expect a low value of NMI, but using this definition we would get at least 0.5. If we
instead have a partition P where all permutations of valid clusters are contained, which corresponds to
2n − 1 different clusters, the value of NMI relatively to partition P ′ would also be greater than 0.5. The
authors propose an alternative definition to deal with these faults:
ONMImax =I(X,Y )
max(H(X), H(Y ))(3.13)
3.1.3 Succinct representation as clustering metric
Another good way of evaluating clusterings is testing how well they work for graph compression or succinct
representation of a graph. As stated previously, Webgraph [9] uses this knowledge to obtain a reordering
of the vertices of the graph, which is useful to achieve a better compression of graph files. Our method
can then be evaluated by the achieved compression rates, which will also be compared with a random
permutation of the vertices and with the results obtained by the reference algorithms.
3.2 Graph Clustering
3.2.1 Faststep
As explained earlier, Faststep can take near-linear time for increasing size of the graph. However, the
algorithm scales quadratically with the number of factors in the matrix we want to obtain. With this
consideration in mind, it is expected that the algorithm would not be fast enough as the number of
factors approaches the size of the graph. In the considered reference algorithms, it is possible to obtain
a high granularity of communities. It is also an important goal of this project the usage of Faststep as a
reordering method to achieve better compression of a graph. High granularity of the communities is, in
fact, one of the reasons that allows a clustering algorithm to produce a good reordering of the graph.
To have a basic understanding on how Faststep would be affected by an increasing number of factors,
we used the youtube dataset [26], a network constituted by approximately 1 million nodes and 3 millions
edges and ran Faststep with different number of factors. Faststep running time depends also on the time
needed for the gradient descent to converge. So, even though a larger value of k doesn’t imply directly
that the algorithm will take a longer time to run, we discovered that even for relatively small values
of k (k=32), the algorithm would take days. To overcome this problem, we decided to create a new
recursive method, based on Faststep and using only small values of k. That method starts with the entire
network. In the first step, it runs Faststep with a factor of 1. That way, we are trying to reconstruct only
one community of the graph. We then partition the graph into two: the obtained community and the
34
remaining vertices of the graph. In a second step, the method is called for each of these two graphs. It
stops partitioning when a specific threshold for graph size is reached. The pseudo-code of the algorithm
is the following:
Function divide(graph)if graph.size < threshold then
graph.save();return;
d = faststep(k=1);g1 = list();g2 = list();for i = 0 to graph.size do
if belongsToCluster(d.rows[i][0]) theng1.add(i);
elseg2.add(i);
end if
end fordivide(g1);divide(g2);
endAlgorithm 1: Recursive Faststep Method.
The presented method allows to obtain hierarchical communities. Also, the algorithm can be modified
to work with k greater than 1, but it is not trivial to define which k value to use for two reasons: when
using k greater than one, we are looking for more than 1 community, which might, in fact, not exist, and
the obtained communities may be overlapping, which has also to be solved for the algorithm to work.
The function belongsToCluster was initially defined to use equation 2.9, which determines if a node
belongs or not to a community, according to the authors of Faststep. The results of using this approach
were very unrealiable: the communities found depended a lot on the number of iterations used internally
for each factor, and its behavior depended on the network used. In most of the example graphs we used,
Faststep would find no communities, or very small communities, even in large graphs. The algorithm
would find small communities in the graph, which usually it could divide hierarchically, while the majority
of the graph would be left untouched, with no clusters found.
We decided to change the way communities are chosen, using directly the values obtained on the rows
of the matrix. In this situation, with the number of factors equal to 1, we have only one row in the
matrix, which we can use to tell how relevant each node is for the community. We analysed the values
obtained in this row for real networks, and we could infer that they usually followed a power-law, as seen
in Figure 3, which is expected. Our idea was to consider only the elements whose values in the matrix
were greater that some specific cutoff value, which depended directly on the values found. We obtained
acceptable results when using the average of the values as the cutoff. That way, we managed to remove
the elements that were in the long-tail of the power-law, keeping only the ones with high value.
This strategy allows to aproximate the community found. As such, even if there is a clear difference
in value between elements inside the cluster and elements outside, it is highly unlikely that the algorithm
finds the exact cluster. The only reason we had to adopt this method is that the initial equation, proposed
35
by the authors, would simply not return valid results, as explained.
Figure 3: Values of the matrix obtained byFaststep (single row) for a real network, indecreasing order.
The method was then modified to use an iterative ap-
proach where communities are kept in a queue, and larger
communities are split first. When all communities currently
in the queue are smaller than the threshold, we can save
this information as one valid clustering of the graph. The
threshold can then be further reduced, allowing the method
to continue, retrieving more clusterings of higher granulari-
ties. This approach keeps all the subgraphs in a heap, with
the larger subgraph on the top. The larger graph can then
be removed from the top and split into two, which are added
to the heap. Every time the size of the larger subgraph is
smaller than the current threshold, the clustering obtained
can be stored. The entire structure of communities is kept in a tree, which is an auxiliary structure of
the algorithm.
One problem with this method is that it may force the clusters to have the same size, as they are
broken into smaller ones according to the number of nodes they have.
The complexity of this method is the same as Faststep (check equation 2.8), with the difference that
the number of executions of the algorithm is greater, and cannot be directly estimated. We are using the
algorithm for graphs, so n = m, and r = 1 as defined by the algorithm. The complexity is then:
O(IT (E + (P + n) log n+ S)), (3.14)
where I is the number of runs of Faststep needed for the termination of the method. The number of
runs cannot be bounded directly and depends on the configuration of the network.
3.2.2 Louvain Method
Louvain Method is, as explained previously, a hierarchical clustering method. As such, when looking for a
clustering of the graph, we can get several valid clusterings, with different levels of granularity. Each valid
clustering can then be partially decomposed: we may pick one cluster and replace it with the clusters
from a clustering with a higher level of granularity. Therefore, the comparison of the clusters obtained
by louvain method with the clusterings obtained by another method cannot be done directly. To simplify
the process, reducing the number of comparisons to be done, the clusterings considered are only the ones
that correspond to the same level of depth in the hierarchy of communities i.e., the communities that are
on the same depth in the dendrogram obtained by the algorithm.
3.2.3 Layered Label Propagation
LLP uses internally several iterations of APM, with different values of λ. The results of these iterations
can be retrieved and used directly in the comparison of clusters.
36
3.3 Graph Compression
As stated previously, one of the goals of this work is to understand how hierarchical clustering algorithms
can be used to obtain good compression of graphs.
A key step is to obtain a reordering of graph from hierarchical communities. This can easily be done
by storing the hierarchical communities in a tree, where leaves are the nodes of the graph and each
internal node represents a cluster. We can then traverse it using a Breadth First Search. The order of
each node in the obtained reordering is given by the order in which it occurs in the search. The choice
of which child node to search into first is arbitrary.
This approach guarantees that closer nodes in hierarchical structure get closer positions in the obtained
reordering, which implies a higher level of obtained compression.
Therefore, any method that is capable of obtaining hierarchical communities also solves the problem
of reordering the graph. Both recursive Faststep and Louvain method are hierarchical, so this algorithm
can be directly applied to obtain the clustering.
3.4 Implementation
3.4.1 Recursive Faststep implementation
Recursive Faststep was implemented in C++, and used the code provided by the original authors. As
explained previously, we use Faststep with k = 1. The number of samples is set to 2×E, which provides
a good compromise between running time and the quality of the results obtained.
3.4.2 Command Line tool
The algorithms used in this project are implemented using different programming languages. The easier
way to merge together these methods was to provide a command line tool, which internally make system
calls to run them. Some extra features are also included in the tool, such as comparison of clusterings,
checking the size of resulting graph files and obtaining a labelling from any set of clusterings provided.
The project source files are hosted on Github [27].
Some parts of the tool are implemented using C++, when performance is necessary, and some are
implemented using python, as it provides easier interaction with files without using operative system
specific methods. The tool was developed and works on Unix-like systems. It was not tested on Windows,
but it should work if python and java are installed and the program is properly compiled with this target
in mind.
Graphs are stored using Webgraph’s .graph files. Reorderings and clusterings of the graphs are
encoded in plain text, after being converted from the original output files of the algorithms. Table 2
describes the fhgc command line tool and the available instructions.
37
Command Arguments
create-dataset <raw-data> <graph-out> [options]
creates a Webgraph graph, with .graph, .properties and .offsets
files, given the data file and the output base file
raw-data: path to a file containing one line ri per edge. Each line
contains two space separated integers ri,0 and ri,1, the endpoints
of edge i of the graph. By default, the edges are interpreted as
undirected and the numbering of the vertices starts at 1.
graph-out: base path to the webgraph files: .graph, .properties
and .offsets.
Options:
–sub0: vertex numbering starts at 0
–directed: edge (ri,0, ri,1) is interpreted as a directed edge from
ri,0 to ri,1
clusters faststep/louvain/LLP <graph> <output-folder>
generates the clustering for a graph using some method
faststep/louvain/LLP: available methods for clustering
graph: graph base path
output-folder: output folder where communities file will be writ-
ten
labels faststep/louvain/LLP <graph> <output-file>
generates the reordering for a graph using some method
faststep/louvain/LLP: available methods for reordering.
graph: graph base path.
output-file: output file where the reordering will be written.
clusterings2labels <graph> <clusterings-prefix> <output-file>
generates the reordering for a graph using a set of clusterings
(which may or may not be hierarchical), provided externally
graph: graph base path
clusterings-prefix: the prefix of the files containing the cluster-
ings. Each clustering file provided must have a n lines, where each
line contains a single integer ci, the community node i belongs to.
output-file: output file where the reordering will be written.
reorder <graph-in> <graph-out> <indexes>
reorders a graph using the provided reordering file
graph-in: input graph base path.
graph-out: output graph base path.
indexes: output file where the reordering will be written.
38
compare-clusters <cluster-file-1> <cluster-file-2> [Options]
compares clusterings using a metric: either Jaccard distance or
NMI (default metric)
cluster-file-1: clustering file path. The clustering file provided
must have a n lines, where each line contains a single integer ci,
the community node i belongs to
cluster-file-2: output graph base path. Same file format.
Options:
–nmi: Uses NMI as clustering metric
–jaccard: Uses Jaccard Distance as clustering metric
compare-
clusters-list
<cluster-file-1> <cluster-file-2> [Options]
compares a set of clusterings with another set of clusterings, using
a metric: either Jaccard distance or NMI (default metric)
regex-clusters-1: multiple clustering files. This argument sup-
ports regex (using the same syntax of the command line tool ”ls”),
which must be defined between quotation marks. Each clustering
file provided must have a n lines, where each line contains a single
integer ci, the community node i belongs to
regex-clusters-2: multiple clustering files. Same file format.
Options:
–nmi: Uses NMI as clustering metric
–jaccard: Uses Jaccard Distance as clustering metric
size <graph> <more graphs>
Returns the size of one or more graph files
graph: graph base path
more graphs: base paths of more graph files
help
Show usage information
Table 2: How to use the fhgc tool.
39
40
4 Results
4.1 Datasets
Multiple datasets were used for the evaluation of the algorithms. We were interested in experimenting
with networks from different origins and with different sizes. Another important aspect for their choice
was the availability of ground-truth communities, which allows us to see how the algorithms perform
under real situations.
The datasets used are presented in Table 3.
Dataset Description Nodes Edges Directed Ground-truth
communities
Airports [28] Network of airports, where airports are nodes andedges are relations of the type ”airport A has directflights to airport B”.
3425 37595 Yes No
Amazon [29] Network of products sold at amazon.com. The con-nections between products represent the relations”Customers who bought this item also bought”,available in the website.
334863 925872 No Yes
Web-google[30]
Web graph released by Google as part of a program-ming contest. In a web graph, web pages repre-sent nodes and links between them represent directededges.
875713 5105039 Yes No
Youtube [26] Youtube social network, where users can form friend-ship relations with each other and create groupswhich other users can join. These groups are con-sidered ground-truth communities.
1134890 2987624 No Yes
Wiki [31] Web graph obtained from Wikipedia pages belongingto the top categories (categories with more pages)and considering links between pages as directededges. Ground-truth communities in the graph arethe categories themselves.
1791489 28511807 Yes Yes
Table 3: Datasets used: source and information.
4.2 Detection of communities in artificial networks
As a first basic test for the clustering algorithms, we generated random networks with a defined number
of communities and evaluated the results obtained.
The networks created had two fixed parameters: number of nodes equal to 10000 and average node
degree, k, equal to 50. We also considered two variable parameters: number of communities C and
probability of rewiring prewire. We then distributed the nodes randomly between communities: each
node belongs to one and only one community. Each community Ci would then have ni nodes. We then
chose, for each community i, k×ni
2 edges randomly. Each edge was a link between a node inside the
community chosen randomly and another node that could either be:
• with probability 1− prewire, a node of the same community Ci, also chosen randomly
• with probability prewire, a node of another community chosen uniformly between the communities
of the graph and between the nodes of the chosen community.
These benchmarks were quite simple and offer only tests with communities with similar size. The
parameter C took values from to 2 to 10, and values from 10 to 50 with intervals of 5, and prewire took
41
values from 0.0 to 0.9, with intervals of 0.1. The results of the tests were evaluated against the original
communities, using both NMI and Jaccard Distance as defined in subsection 3.1. Each test was run using
5 random networks with the defined parameters and the average of the metric value obtained was taken.
Figures 4 and 5 present the results for the three algorithms considered, for the NMI and Jaccard Index
metrics, respectively.
Figure 4: Benchmark results, using the NMI metric, for the tested methods: Louvain (left), LLP (center),Faststep (right).
Figure 5: Benchmark results, using the Jaccard Index metric, for the tested methods: Louvain (left),LLP (center), Faststep (right).
We can easily see that Louvain Method and LLP obtained similar results: when the communities
were not excessively degenerated by the rewiring, they detected all communities with small errors.
Faststep obtained poor results: the values of both NMI and Jaccard metrics do not exceed 0.6, for
all the tests considered. As expected, lower values of rewiring probability imply better reconstruction of
the clustering.
We can argue that the communities and the nodes of the test graphs present a very homogeneous
network, with low clustering coefficient relatively to the number of edges in the graph. Faststep is
42
designed for real networks, which usually are scale-free networks, have high clustering coefficient and
contain hyperbolic communities. Also, as the communities of graphs are equally strong, i.e., have similar
density and size, it is difficult for Faststep to pick only one. The error accumulated by these assumptions
might affect the performance of the algorithm.
This test presents strong indications that Faststep might not perform well in the next tests, but we
do expect better results when using real networks.
4.3 Detection of communities in networks with ground-truth communities
The Amazon, Youtube and Wiki datasets provide ground-truth communities, i.e., the datasets contain
information that allows us to infer known communities of these networks. In these datasets, the ground-
truth clusters have some degree of overlapping, different granularities and sizes, which make difficult for
the tested methods to obtain good results.
In Figure 6, we present the results for the three datasets using both Overlapping NMI and Average
Jaccard Distance between the ground-truth clusters and the found communities, comparing directly the
score of obtained clusterings for the different methods. We then find, for each ground-truth community,
the respective obtained cluster that has the smallest Jaccard distance. In Figure 7, we present the plot of
the sizes of the ground-truth communities against the Jaccard distance with the most similar community
found. This way, we can access how biased the methods are to finding communities of specific sizes.
For the Amazon dataset, using the Jaccard distance, LLP and the higher granularity clusterings of
Louvain method achieve results of around 80%. However, good results in the ONMI metric are much
harder to achieve. The results didn’t exceed 5%, but in relative results, when comparing the curves
obtained for LLP and Louvain, they seem to agree with the curves for the Jaccard distance.
The Youtube dataset got low results for the Jaccard distance, both for LLP (up to 30%) and Louvain
method (up to 20%). The results for ONMI were too low, which allows to understand the communities
found were not, in any way, similar to the ground-truth ones.
The results for the Wiki dataset are also quite low for both metrics. The Wiki dataset is a graph
naturally denser, which usually can make the found clusters to be larger. The Louvain method would
usually stop in 2-3 iterations, as the modularity would continually increase on each merge. Also, this
dataset is a topic network and, as we know, topics are interdisciplinary and hierarchical, which implies
that a lot of different clusterings can be accepted, but not when testing against these ground-truth
communities.
The results for Faststep are of almost 0% for both metrics and for all datasets. We can only suppose
that the communities found by the method are simply different from the known communities used.
The observation of the Figure 7, allows us to conclude that there is no correlation between the size of
the ground-truth community and Jaccard distance of the most similar cluster found.
As an overall evaluation of this set of tests, LLP and Louvain method detected the ground-truth
communities of the Amazon dataset, with some errors. The results for the other datasets were not as
good. It is also important to remember that these two methods run in near-linear time for the number
of nodes of the graph, compromising the quality of the results obtained.
43
Faststep presented really low results in this test. We could accept some degree of difference between
known communities and found clusterings. However, the other methods performed much better at finding
the known communities, which implies that these can be partially approximated using the structure of the
graph, and Faststep is simply failing to detect them. The next results will help to clarify this situation.
44
Figure 6: Detection of ground-truth communities using different levels of granularity of the methods,using three different datasets: Amazon (top row), Youtube (middle row) and Wiki (bottom row). Theresults are evaluated with two different metrics: ONMI (plots on the left column) and Jaccard Distance(plots on the right column).
45
Figure 7: Detection of ground-truth communities: relation between size of each ground-truth communityand quality in which the methods found it. The datasets used are Amazon (top row), Youtube (middlerow) and Wiki (bottom row). The methods used are Louvain method (left column), LLP (middle column)and Faststep (right column).
46
4.4 Comparison of clusterings obtained with different methods
Clusterings obtained by different algorithms can then be compared directly. For this test, ground-truth
networks are not needed, allowing us to use any graph. As done previously, we compare each clustering
of each algorithm. Figure 8 presents the results using the NMI metric while Figure 9 presents the results
for the Jaccard Distance.
An interesting finding that is important to analyse these results is the fact that all clusterings obtained
with LLP have usually higher granularity than those obtained with Louvain method. In some cases, the
higher granularity clustering from Louvain method (first level of iterative modularity merges) can have
more clusters than the one with lowest granularity from LLP, but they usually have a similar number of
communities. With this fact in mind, we can then expect the most similarity between clusterings to be
in these two, for all datasets, and is in fact what happens.
There is a high similarity between LLP and Louvain clusterings, when considering the NMI metric.
For this metric, the worst results are for the Wiki dataset. As discussed earlier this dataset is much
denser than the others, which makes the first iteration of Louvain method to merge more than usual.
We can see for example, with the youtube dataset, which has a similar size, that in the first iteration of
Louvain method, around 200000 clusters are obtained, while for the Wiki dataset, only 6655 are returned.
For this Wiki dataset, in the next iteration, the number of clusters is reduced to only 77. LLP, in its
iteration that outputs less clusters, returns 7677, which is rather similar. In the next one, returns 174698
clusters. We can see that in this case there is a big difference between the granularities of the clusterings
obtained by the two methods, which naturally is reflected on the results.
The similarity between Faststep and the other two methods is rather low. The values for this measure
are not greater than 0.8 for any the tests.
We can also recognize in the tests between Faststep and LLP that the NMI values are, in some degree,
independent of LLP clusterings (see results for Amazon or Web-google). This means that the NMI is
just increasing with the granularity of Faststep clusterings. This problem is related to the way NMI
is defined: a clustering with very high granularity can be seen as a further partitioning of an already
existing clustering, which corresponds to relatively high NMI value, instead of a low value, as it would be
expected. NMI is designed to penalize this type of mismatch between clusterings, but the normalization
is usually not enough to all the cases, and can usually penalize too much in cases where the similarity is
indeed considerable.
For the tests with Louvain method, this issue doesn’t occur due to the fact that granularity of the
Louvain clusterings is small. The mutual information between large clusters and really small clusters will
be low. This way, we obtain a lot of values near zero.
When considering the Jaccard metric, we can see that the values are much lower, except for the
Amazon dataset. It is interesting to note that both algorithms (LLP and Louvain method) obtained
good results for this dataset, in the test for ground-truth communities, which indicates that this dataset
might be a good testing example, having features than make algorithms perform especially well with it.
Probably, the relation between Amazon products is very well structured, with one and only one possible
clustering, which makes the methods agree, which happens here. The results obtained for the other
47
datasets were worse.
We can see that there is a large difference between the two metrics used. This difference can be
accepted in the basis that NMI is a metric specifically tailored for this type of test, while Jaccard distance
is a more generic one. Also, NMI evaluates how much information two clusterings have in common, while
Jaccard distance is focused on the number of common elements. Even so, we can observe, for the Jaccard
metric, that the clusterings with more similarity are the high granularity ones obtained by Louvain and
the lower granularity ones obtained by LLP, as discussed previously.
Faststep results are extremely low, when using the Jaccard metric.
Overall, when comparing Faststep with the other two methods, we have reasons to believe that they
indeed produce very different clusterings. The similarity obtained using Jaccard metric is quite low,
while the similarity using the NMI is overall high, but doesn’t make much sense as there is no correlation
between the granularities of the clustering and the value of NMI obtained. These results seem to indicate
that clusterings obtained by Faststep are quite different from the ones obtained with other methods. We
can still evaluate independently the clusterings of each algorithm, taking advantage of graph compression,
which we do in the next test.
48
Figure 8: Comparison of the clusterings obtained using the NMI metric. The rows, from top to bottom,present the results for the airports, Amazon, Web-google, Wiki and Youtube datasets. Left column plotsLouvain against LLP, middle column plots Louvain against Faststep and right column plots Faststepagainst LLP. 49
Figure 9: Comparison of the clusterings obtained using the Jaccard Distance metric. The rows, from topto bottom, present the results for the Airports, Amazon, Web-google, Wiki and Youtube datasets. Leftcolumn plots Louvain against LLP, middle column plots Louvain against Faststep and right column plotsFaststep against LLP. 50
4.5 Compression Results
As a final test, we can evaluate the quality of the algorithms by using their clusterings to obtain a
reordering of the graph, and see how much space can be saved when this graph is compressed using
Webgraph. This test is a useful technique to measure the quality of a clustering. As seen previously,
Faststep does appear to find really different clusterings, when comparing with other methods or known
communities. This test can help us to understand if, although obtaining different results, its results are
in fact acceptable.
The original permutations of the vertices of the datasets used are usually in some order which allows
better compression using Webgraph. As such, the clustering/reordering algorithms are run on a randomly
permuted version of the original graphs, to eliminate the impact of the initial ordering from the effective
compression rate. The sizes of the compressed graphs are presented in Table 4. The space savings of the
methods are presented in Table 6.
Random permu-tation
Reordered withLLP
Reordered withLouvain
Reordered withFaststep
Amazon 5530 kB 1881 kB 1745 kB 4015 kBAirports 50 kB 25 kB 33 kB 37 kBWeb-google 15274 kB 3650 kB 3658 kB 8221 kBWiki 79586 kB 51913 kB 62702 kB 67879 kBYoutube 16548 kB 8848 kB 10724 kB 13652 kB
Table 4: Size occupied by the dataset after being reordered with different methods.
Reordered with LLP(%)
Reordered with Lou-vain (%)
Reordered with Fast-step (%)
Airports 50.00 34.00 26.00Amazon 65.99 68.44 27.40Web-google 76.10 76.05 46.18Wiki 34.77 21.21 14.71Youtube 46.53 35.19 17.50
Table 5: Space savings in percentage, relatively to the randomly permuted graph file.
LLP is an algorithm specifically designed for graph compression using Webgraph. Nevertheless, Lou-
vain method managed to obtain similar results, gaining by a bit in the Amazon and Web-google datasets.
As discussed previously, Louvain method usually obtains communities with a lower granularity than
the ones provided by LLP. In some graphs, more than in others, the first iteration of Louvain method
merges a lot a nodes, achieving a lower granularity of communities. When compressing a graph, we want
to do a full reorder of its nodes. To achieve that, we need high granularity clusterings: when using low
granularity clusterings, some sets of nodes will simply stay in the same order, not improving compression.
We believe that, for some of the datasets used, that is exactly what is happening with Louvain method.
In some networks, for example, in the Wiki one, the number of clusterings obtained by the Louvain
method is also too low. For a better compression, not only we need a lot of clusterings with high
granularity but it is better if each cluster in the hierarchical structure has the least number of children.
In the way that a hierarchical clustering is transformed into a reordering of the graph, the traversal
51
method of the structure has to decide which child node to visit first. This choice is completely arbitrary,
but it has an impact on the compression obtained. If two clusters are siblings in the hierarchical tree,
we do not know which of them to traverse first, but if one is the parent of the other, that ambiguity
disappears. Therefore, we can reduce this ambiguity if there are more levels of hierarchy which, for
Louvain method, some times there isn’t. This problem could easily be overcome by using the entire
hierarchical structure of the network generated internally by the method, which is not an direct output
in the implementation provided by the authors.
Faststep obtained reasonable results. Naturally, the compression obtained is higher when the compres-
sion obtained by the other methods is also high, although Faststep results are always much lower than the
other methods. In the previous tests we have already introduced several reasons why the performance of
Faststep would probably not be as good as the other algorithms. Firstly, Faststep builds one community
at a time. In that way, all nodes that belong to overlapping communities tend to be included in the first
community created, while nodes with less edges but some proximity with this community that is being
created are simply left out. Therefore, a large error is introduced in the first step of the algorithm. This
error is successively increased by the next steps of the method. Also, the threshold which defines whether
a node belongs or not to the community is just a loose approximation. However, we can still detect some
compression of the graphs, which implies that Faststep does indeed find meaningful communities in the
context of graph reordering and/or compression, which remained a questionable assumption in the other
tests performed. Still, we empirically showed that is not the best method for the job, and that there are
much better alternatives for this purpose.
4.6 Performance Evaluation
For a final evaluation of the methods, we measured the execution time needed for each for the datasets
used.
The following results were obtained on an Intel(R) Xeon(R) CPU E7 - 4830 @2.13GHz computer,
with 64GB of RAM, running Linux, version 4.4.29.
LLP Louvain Faststep
Airports 2.21 0.431 5.76Amazon 32.76 36.00 949.8Youtube 54.09 68.99 2443.03Web-google 116.24 59.53 4138.67Wiki 271.32 234.65 18270.34
Table 6: Execution time (in seconds) of each algorithm, for each dataset. Rows sorted by number ofedges in each dataset.
LLP and Louvain have very efficient implementations and it is easily observable that they have similar
execution times. It is also interesting to observe that there is no clear “winner” between them: LLP is
faster in two datasets, Louvain method is faster in the other three.
Faststep was much slower. Its execution time can be justified by our approach to the problem. Faststep
allows to set both the minimum number of iterations and final error we want to obtain. Until the defined
52
criteria are met, the algorithm keeps iterating over the matrix. As we have seen in the previous tests,
the results were far from good, and we chose to set a very low error value, trying to obtain the best
results possible. It could also be argued that we could have compromised the quality of the clusterings to
improve the running time. However, in our point of view, the most important contribution of this work
is to identify if Faststep could or could not be used for these problems in study, i.e., check if the quality
of its results is similar to quality the other tested algorithms. In that way, much less importance is given
to the running time of the algorithm. In a situation where the quality of the results was better, a greater
effort could have been put in improving the overall performance of our Faststep version.
53
54
5 Final Remarks
In this work, we proposed an adaptation of Faststep for the problem of hierarchical clustering. We
compared its results with two state-of-the-art algorithms, LLP and Louvain method. We tested how these
three algorithms could be used for graph reordering and compression of networks using the Webgraph
framework. Finally, we implemented a tool which allows the retrieval of graph clusterings and reorderings
and the use of these clusterings or reorderings in the compression of the graph.
Due to the fact that communities could not be retrieved directly using the method provided by the
authors, we did a possible approach to this problem. Our method tries to pick one community at a
time, in contrast to other algorithms that propagate labels or merge clusters incrementally. This has a
important impact on overlapping nodes. Overlapping nodes are usually hubs and tend to have many valid
communities. In this situation, they cannot be placed on more than one community, and so Faststep
places them on the first that is found. The other methods can somehow balance this effect, as communities
are constructed incrementally and can gather the more important nodes to themselves. This approach
can lead to a lot of variations between the communities found by Faststep and the communities found by
another algorithm. Also, the cut-off added to the method to decide whether a node belongs or not to the
community introduces a lot of error. As said previously, the simultaneous construction of the communities
allows them to control each other; Faststep, on the other side, uses a much more crude approach which
can place too much nodes in the cluster, but also too few. The method can have a reasonable performance
when dealing with difficult graphs, especially when communities are heterogeneous, as the boundaries
between communities are not well defined and the error is acceptable. The results obtained on graph
compression through reordering prove this idea. However, when dealing with graphs with clear clusterings,
even ones that are perfectly bounded, the implemented cut-off is incapable of accessing this information
and introduces a lot of error.
This details of the method obtained worse results than we could expected in the beginning of the work.
It was specially inaccurate for artificial communities, which don’t have the features that characterize real
networks. When applied to real networks, the clusterings obtained were very different from the ones
provided by LLP or Louvain method. Despite these results, it managed to obtain reasonable ones in
graph reordering/compression.
The performance of Faststep could benefit a lot with an improved method to decide if a node belongs
or not to a cluster. However, we understood the limitations of the algorithm. In fact, Faststep was
designed to solve a different problem, matrix factorization. As such, its results, although interesting
for the comparison of the two problems, cannot be expected to be better than the ones obtained by
algorithms specifically proposed for this matter.
We believe that our study of Faststep as a clustering, hierarchical clustering and reordering algorithm
allowed to explore the possibilities of the method. Even though there are other approaches to this
problem, we consider that there is not much that can be done after this, as the results are a long way
from those of LLP and Louvain method, and any possible improvement would not, in our point of view,
be enough to make it competitive in both quality and execution time.
55
However, we think that Faststep could be improved in a lot of ways, which could perhaps lead to much
better results. These improvements were mentioned in this work and are directly related to the results
in tests performed. A proper technique of defining which nodes belong or not to a community, which
works for all types of networks would be the most urgent. Currently, Faststep works with a fixed number
of communities. Extending the algorithm to allow it to search or refine the number of communities in
the network could possible extend the usage of algorithm to other interesting problems. Another useful
improvement would be increasing the efficiency of the algorithm, as it takes much a longer time to run
that other algorithms with similar time and space complexities.
In our opinion, the results with Louvain method to obtain a reordering of the graph seem promising
and, as previously suggested, they could benefit if the hierarchical structure of the graph, obtained
internally in the maximization of the modularity, was used in the reordering algorithm. We could also
understand that the clusterings obtained by LLP have, in general, a higher granularity than those obtained
by Louvain method, which presents an opportunity to study how their reorderings could be used together
for an improved result.
56
References
[1] M. Araujo, P. Ribeiro, and C. Faloutsos, “Faststep: Scalable boolean matrix decomposition,” in
Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 461–473, Springer Interna-
tional Publishing, 2016.
[2] M. E. Newman and M. Girvan, “Finding and evaluating community structure in networks,” Physical
review E, vol. 69, no. 2, p. 026113, 2004.
[3] S. Fortunato, “Community detection in graphs,” Physics reports, vol. 486, no. 3, pp. 75–174, 2010.
[4] U. Brandes, D. Delling, M. Gaertler, R. Gorke, M. Hoefer, Z. Nikoloski, and D. Wagner, “On
modularity clustering,” IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 2,
pp. 172–188, 2008.
[5] P. Boldi, M. Rosa, M. Santini, and S. Vigna, “Layered label propagation: A multiresolution
coordinate-free ordering for compressing social networks,” 2010.
[6] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast unfolding of communities
in large networks,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2008, no. 10,
p. P10008, 2008.
[7] R. Albert and A.-L. Barabasi, “Statistical mechanics of complex networks,” Rev. Mod. Phys., vol. 74,
pp. 47–97, Jan 2002.
[8] P. J. M. Mason A. Porter, Jukka-Pekka Onnela, “Communities in networks,” 2009.
[9] P. Boldi and S. Vigna, “The webgraph framework i: Compression techniques,” in Proceedings of the
13th International Conference on World Wide Web, WWW ’04, (New York, NY, USA), pp. 595–602,
ACM, 2004.
[10] M. Araujo, S. Gunnemann, G. Mateos, and C. Faloutsos, “Beyond blocks: Hyperbolic commu-
nity detection,” in Joint European Conference on Machine Learning and Knowledge Discovery in
Databases, pp. 50–65, Springer Berlin Heidelberg, 2014.
[11] G. Golub and W. Kahan, “Calculating the singular values and pseudo-inverse of a matrix,” Journal
of the Society for Industrial and Applied Mathematics Series B Numerical Analysis, vol. 2, no. 2,
pp. 205–224, 1965.
[12] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,”
Nature, vol. 401, no. 6755, pp. 788–791, 1999.
[13] U. N. Raghavan, R. Albert, and S. Kumara, “Near linear time algorithm to detect community
structures in large-scale networks,” Phys. Rev. E, vol. 76, p. 036106, Sep 2007.
[14] P. Ronhovde and Z. Nussinov, “Local resolution-limit-free potts model for community detection,”
2008.
57
[15] A. Noack and R. Rotta, “Multi-level algorithms for modularity clustering,” CoRR,
vol. abs/0812.4073, 2008.
[16] K. Wakita and T. Tsurumi, “Finding community structure in mega-scale social networks,” CoRR,
vol. abs/cs/0702048, 2007.
[17] L. Danon, A. Dıaz-Guilera, and A. Arenas, “The effect of size heterogeneity on community identifi-
cation in complex networks,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2006,
no. 11, p. P11010, 2006.
[18] S. Fortunato and M. Barthelemy, “Resolution limit in community detection,” Proceedings of the
National Academy of Sciences, vol. 104, no. 1, pp. 36–41, 2007.
[19] U. Brandes, “A faster algorithm for betweenness centrality,” The Journal of Mathematical Sociology,
vol. 25, no. 2, pp. 163–177, 2001.
[20] W. W. Zachary, “An information flow model for conflict and fission in small groups,” Journal of
Anthropological Research, vol. 33, no. 4, pp. 452–473, 1977.
[21] “Stanford large network dataset collection.” https://snap.stanford.edu/data/index.html, 2017.
[Online; accessed 2-January-2017].
[22] S. Fortunato, “Benchmark graphs to test community detection algorithms.” https://sites.
google.com/site/santofortunato/inthepress2, 2017. [Online; accessed 6-January-2017].
[23] L. Danon, A. Diaz-Guilera, J. Duch, and A. Arenas, “Comparing community structure identifi-
cation,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2005, no. 09, p. P09008,
2005.
[24] A. Lancichinetti, S. Fortunato, and J. Kertesz, “Detecting the overlapping and hierarchical commu-
nity structure in complex networks,” New Journal of Physics, vol. 11, no. 3, p. 033015, 2009.
[25] A. F. McDaid, D. Greene, and N. Hurley, “Normalized mutual information to evaluate overlapping
community finding algorithms,” Oct. 2011.
[26] “Stanford large network dataset collection, youtube dataset.” https://snap.stanford.edu/data/
com-Youtube.html, 2012. [Online; accessed 2-July-2017].
[27] M. M. Duarte, “Fast hierarchical graph clustering tool on github.” https://github.com/michel94/
fhgc-tool, 2017. [Online; created 13-October-2017].
[28] “Openflights: Airport and airline data.” https://openflights.org/data.html, 2017. [Online;
accessed 7-July-2017].
[29] “Stanford large network dataset collection, amazon dataset.” https://snap.stanford.edu/data/
com-Amazon.html, 2012. [Online; accessed 2-July-2017].
58
[30] “Stanford large network dataset collection, google web graph dataset.” https://snap.stanford.
edu/data/web-Google.html, 2009. [Online; accessed 2-July-2017].
[31] “Stanford large network dataset collection, wikipedia network of top categories dataset.” https:
//snap.stanford.edu/data/wiki-topcats.html, 2009. [Online; accessed 2-July-2017].
59
60
A Shannon Entropy
Entropy is the expected value of information contained in a data source.
If X is a random data source and P (x) is the probability of occurrence of value x, the entropy is given
by:
H(X) = −∑x∈X
P (x)log2P (x)
If Y is a random variable and P(x, y) is probability of occurrence of value x in X and value y in Y , the
entropy of Y knowing X is given by:
H(Y |X) = −∑
x∈X,y∈YP (x)log2P (x, y)
which quantifies the amount of information needed to describe the outcome of a random variable Y given
the value of another random variable X.
61