View
1
Download
0
Category
Preview:
Citation preview
Blair D. Sullivan Complex Systems Group Center for Engineering Science Advanced Research Computer Science and Mathematics Division Oak Ridge National Laboratory
Branching Out: Quantifying Tree-like Structure in Complex
Networks
MMDS, July 12, 2012
Joint work with Michael Mahoney & Aaron Adcock, Stanford University
2 Managed by UT-Battelle for the U.S. Department of Energy
Motivation • Large networks are becoming ubiquitous in many
domains – e.g. biology, physics, chemistry, infrastructure, communications, and sociology
• Many methods to understand structure at very large-scale (diameter), small-scale (clustering coefficient); very few to probe intermediate scale (clusters of size 5K in a 5M node network). Can we get good tools to understand and exploit this?
A partial map of the Internet, January 15 2005
The US electric transmission system. Courtesy North American Reliability Corporation. Drug-Target Network.
Nature Biotechnology 25(10), October 2007
3 Managed by UT-Battelle for the U.S. Department of Energy
Intermediate-Scale Structure
Ising model (ferromagnetism): Temperature parameter controls scale of local correlations between magnetic spins.
4 Managed by UT-Battelle for the U.S. Department of Energy
Intermediate-Scale Structure
• Determines network evolution & dynamics of diffusion, other processes
• Implicitly affects applicability of common data analysis tools
• This is where all the “interesting stuff” happens.
Ising model (ferromagnetism): Temperature parameter controls scale of local correlations between magnetic spins.
The “intermediate-scale structure” is the coupling of local & global properties.
5 Managed by UT-Battelle for the U.S. Department of Energy
Prior empirical evidence Claim: Many large complex networks are “tree-like” when viewed at intermediate scales:
• The Unreasonable Effectiveness of Tree-Based Theory for Networks with Clustering, Melnik, Hackett, Porter, Mucha, Gleeson. Physical Review E, Vol. 83, No. 3 (2010).
• Finding Hierarchy in Directed Online Social Networks, Gupta, Shankar, Li, Muthukrishnan, Iftode. WWW2011.
• "It was noted in recent years that the Internet structure has a highly connected core and long stretched tendrils, and that most of the routing paths between nodes in the tendrils pass through the core. Therefore, we suggest in this work, to embed the Internet distance metric in a hyperbolic space where routes are bent toward the center“ Shavitt, Tankel. 2008. Hyperbolic embedding of internet graph for distance estimation and overlay construction. IEEE/ACM Trans. Netw. 16, 1 (2008).
However, no consensus has been reached on defining and measuring this tree-like structure, making it difficult to exploit algorithmically.
Image credit: Munzer et al
6 Managed by UT-Battelle for the U.S. Department of Energy
Prior empirical evidence Claim: Many large complex networks are “tree-like” when viewed at intermediate scales:
• The Unreasonable Effectiveness of Tree-Based Theory for Networks with Clustering, Melnik, Hackett, Porter, Mucha, Gleeson. Physical Review E, Vol. 83, No. 3 (2010).
• Finding Hierarchy in Directed Online Social Networks, Gupta, Shankar, Li, Muthukrishnan, Iftode. WWW2011.
• "It was noted in recent years that the Internet structure has a highly connected core and long stretched tendrils, and that most of the routing paths between nodes in the tendrils pass through the core. Therefore, we suggest in this work, to embed the Internet distance metric in a hyperbolic space where routes are bent toward the center“ Shavitt, Tankel. 2008. Hyperbolic embedding of internet graph for distance estimation and overlay construction. IEEE/ACM Trans. Netw. 16, 1 (2008).
However, no consensus has been reached on defining and measuring this tree-like structure, making it difficult to exploit algorithmically.
7 Managed by UT-Battelle for the U.S. Department of Energy
Arxiv GR-QC collaboration
What do you mean, “tree-like”?
Image credit: Traub, Kelsic, Mucha, Porter
Image credit: Tim Davis
Facebook: Caltech Network
Autonomous
Systems
Image credit: Graphics@Illinois
8 Managed by UT-Battelle for the U.S. Department of Energy
Hyperbolic Space
• Multiple parallel lines pass through a point, and angles in a triangle sum to less than 180.
• At right, see a {7,3}-tessellation of the hyperbolic plane by equilateral triangles, and the dual {3,7}-tessellation by regular heptagons. All triangles and heptagons are of the same hyperbolic size but the size of their Euclidean representations exponentially decreases as a function of the distance from the center, while their number exponentially increases.
• In Euclidean space, a circle’s area grows polynomially with its diameter; in hyperbolic space, it grows exponentially. Think of growth as in a binary tree.
• The shortest paths in hyperbolic spaces are arcs through disk, not paths around the exterior (much like travel in a rooted tree)
Image credit Krioukov et al.
9 Managed by UT-Battelle for the U.S. Department of Energy
Hyperbolic Embedding and Greedy Routing
• Hyperbolic space gives us “extra room” to embed networks (as opposed to Euclidean space).
• A number of algorithms take advantage of this to devise greedy routing schemes
• Kleinberg uses a minimum spanning tree, embedded as a subset of a d-regular tree, where d is the maximum degree of the MST (d = 4 is shown at right)
Image credit Kleinberg
10 Managed by UT-Battelle for the U.S. Department of Energy
So is it good or bad?
Image credit M.C.Escher
11 Managed by UT-Battelle for the U.S. Department of Energy
A generative model • Three-parameter model introduced by Krioukov et
al uses an underlying hyperbolic geometry and allows us to vary the curvature, degree heterogeneity, and density. (Physicists: this is basically fermions)
• Idea: place nodes in the hyperbolic plane (Poincare disk) and connect them with a probability which is dependent on their hyperbolic distance.
• Knob 1: Power law exponent: determines distribution of nodes in the disk – the higher the exponent, the more nodes go towards the center. This determines the curvature (and degree heterogeneity)
• Knob 2: Temperature: determines how much we ignore the underlying geometry in adding edge; at high temperatures, edge connections become essential random (independent of distance).
• Knob 3: Average degree (target): approximately allows control over density
Power Law 2.1 2.25 2.5
Temperature 20 1.5 0.5
Avg. Degree 5 10 20
Our test parameters
Temp. Finite Infinite
Curv.
Finite Random
hyperbolic graphs
Classical random graphs
(Erdos-Renyi)
Infinite Random
geometric graphs
Random graphs
w/given expected
deg.
12 Managed by UT-Battelle for the U.S. Department of Energy
Special Thanks
Special thanks to D. Krioukov for providing us code to generate networks according to the model described on the previous slide.
Image credit San Diego Reader
13 Managed by UT-Battelle for the U.S. Department of Energy
Hyperbolic Embedding for Inference
• Boguna, Krioukov, Papadopolous have mapped “the internet” to hyperbolic space, and used the embedding to identify community structure (and offer suggested routing schemes).
Image credit Boguna, Krioukov, Papadopolous
• Their methods rely on iterative MLE methods, and do not seem to be scalable to examine “big data”.
14 Managed by UT-Battelle for the U.S. Department of Energy
A geometric measure of tree-likeness
• Gromov’s δ-hyperbolicity arises from the geometry of metric spaces and δ measures the extent to which a (geodesic) metric space embeds in a tree metric.
d(u,v) + d(w,x) = 1 + 1 = 2 d(u,x) + d(v,w) = 1 + 1 = 2 d(u,w) + d(v,x) = 1 + 1 = 2
u δ = 0
d(u,v) + d(w,x) = 1 + 1 = 2 d(u,x) + d(v,w) = 2 + 2 = 4 d(u,w) + d(v,x) = 1 + 1 = 2
δ = 1 v v u
x w x w
• Note: d(u,v) is the length of the shortest path between u and v in the graph.
• The minimum δ for which G is δ-hyperbolic can be computed (naively) in O(n4)
15 Managed by UT-Battelle for the U.S. Department of Energy
More on δ-hyperbolicity
• A triangle is δ-thin if the pre-images of every tripod point have distance at most δ.
• A triangle is δ-slim if each of its sides is contained in the δ -neighborhood of the union of the other two sides.
• A graph is δ -hyperbolic if all its geodesic triangles are δ -thin (or δ-slim); each results in a slightly different min δ, related to each other by small constant factors.
• Viewing graphs as a geodesic metric space (replace edges with length 1 segments intersecting only at endpoints) provides another way to think of δ-hyperbolicity.
• For a geodesic triangle, there is a unique isometry to a tripod so that except for the leaves , each point on the tripod has two pre-images on the triangle.
Image credit: Bridson, Haefliger Image credit: Chepoi, Dragan et al
16 Managed by UT-Battelle for the U.S. Department of Energy
Examples: Small world graphs & Ringed Trees • Kleinberg’s small-world random graphs add
long-range edges with probability proportional to 1/dB(u,v)p to a d-dimensional grid.
• Mahoney et al (2011) showed even at the “sweet spot” of p = d, the small-world graphs are not logarithmically hyperbolic w.h.p. When p < d, the graphs are not hyperbolic, and for p > 3 and d = 1, the hyperbolic delta is polynomial in the size of graph.
• Define a ringed tree to be a binary tree
plus edges connecting all vertices at a given tree level into a ring (quasi-isometric to the Poincare disk)
• Adding long-range edges between the leaves of a ringed tree w/ probability decreasing:
– exponentially fast with the ring distance produces logarithmic hyperbolicity
– as a power-law with the ring distance produces non-hyperbolic random graphs
• Replace the ringed tree with a pure binary tree: none of the resulting graphs are hyperbolic.
Image credit: Mahoney et al
18 Managed by UT-Battelle for the U.S. Department of Energy
Empirical Results: “Planar”
• Planar graphs have a very different distribution of delta over their quadruples, and very high diameters.
19 Managed by UT-Battelle for the U.S. Department of Energy
Empirical Results: “Hyperbolic”?
• Much more subtle differences when looking at non-planar graphs.
• Density seems to play a role, and most networks considered had very low diameter.
20 Managed by UT-Battelle for the U.S. Department of Energy
Computing δ: Sampling • Due to high computational complexity, a number of prior works have used
sampling to estimate the hyperbolicity of large networks.
• Some prior work sampled at a rate of about .0002 percent (on their largest data), and although biased towards pairs at larger distances, this could still easily miss the maximum delta, which is achieved on a very small (in our example 2 x 10-11 percent) subset of quadruplets. Note that sampling, however, is likely to be sufficient for computing average deltas.
• Example below is SNAP graph as20000101 (about 1600 nodes)
delta Fraction of quadruplets: # of quadruplets
0.0: 0.677473774788751 4577453756970
0.5: 0.313235924997126 2116425779202
1.0: 0.009262044976055 62580404070
1.5: 0.000028008357243 189242691 2.0: 0.000000246259522 1663890
2.5: 0.000000000022835 154
Total 0.999999999401533 6756650846976
21 Managed by UT-Battelle for the U.S. Department of Energy
K-core Decompositions
• Given a graph G = (V,E), the k-core of the graph, denoted Hk is the maximal subgraph H of G so that degH(v) is at least k for all v in H.
Image credit: LaNet-vi
•The core number of a vertex v is defined to be the maximum k so that v is in Hk but not Hk+1.
• The set of nodes with core number k is called the k-shell of G.
Condensed Matter Collaboration Network
22 Managed by UT-Battelle for the U.S. Department of Energy
Empirical Results: Social Graphs
Facebook-Texas84
~36,000 nodes
~3x10^6 edges
soc-Epinions1
~47,000 nodes
~730,000 edges
23 Managed by UT-Battelle for the U.S. Department of Energy
Empirical Results: Autonomous Systems
AS19990820 ~5,500 nodes
~22,000 edges
AS19990818 ~5,500 nodes
~22,000 edges
24 Managed by UT-Battelle for the U.S. Department of Energy
Empirical Results: Collaboration Graphs
CA-AstroPhysics ~18,000 nodes
~394,000 edges
CA-GrQc ~4,000 nodes
~26,000 edges
25 Managed by UT-Battelle for the U.S. Department of Energy
Empirical results: Synthetic by power law exponent
26 Managed by UT-Battelle for the U.S. Department of Energy
Empirical results: Synthetic by temperature
27 Managed by UT-Battelle for the U.S. Department of Energy
Some (oversimplified) Summary Statistics
ca-AstroPhysics:
• ~0.6% of nodes (113 nodes) in two deepest cores (k = 55,56)
• ~1.8% of edges (~7,000 edges) leaving the deepest core (k = 56)
• ~1.8% of edges (~7000 edges) leaving next core (k = 55)
• Max average k-shell change is +12 (out of k = 56 max shell)
• Suggests collaborators tend to collaborate with people of similar coreness/peripheryness
• “Typical” for collaboration graphs (and other core-periphery graphs)
Texas84:
• ~8% of nodes (≥2400 nodes) in two deepest cores (k = 80,81)
• ~7% of edges (≥220K edges) leaving the deepest core (k = 81)
• ~17% of edges (≥510K edges) leaving the next core (k = 80)
• Max average k-shell change is +50 (out of k = 80 max shell)
• Suggests that the “periphery” nodes are more tightly connected to “core-like” nodes
• “Typical” for more social graphs (and Facebook in particular)
28 Managed by UT-Battelle for the U.S. Department of Energy
A combinatorial measure of tree-likeness • A tree decomposition of a graph G = (V,E ) is a pair (X={X1, X2, ..., XL}, T) with
Xi a subset of V , and T a tree with nodes {1, …,L} satisfying three conditions:
• The union of the sets in X is equal to V
• For every edge (u,v) in G, {u,v} is a subset of some Xi
• For every v in V, the indices of {Xi} containing V form a sub-tree of T.
• We call the sets Xi the bags of the decomposition and max(| Xi |) the width. The tree-width of G is the minimum width over all valid tree decompositions.
29 Managed by UT-Battelle for the U.S. Department of Energy
Understanding FPT: “problems are easier on trees”
• Many NP-hard problems can be solved in polynomial time on trees (graphs with no cycles)
Example: Maximum Weighted Independent Set: Complexity O(|V|)
• We can generalize this dynamic programming approach to get polynomial algorithms (in graph size) on graphs where tree-width is bounded.
3 2 1 1
3 2 3 4
7
2
1
(3,0)
(3,6)
(1,0)
(2,0) (3,0)
(1,0) (2,0)
(7,5)
(4,1)
(8,10)
(17,15)
30 Managed by UT-Battelle for the U.S. Department of Energy
Heuristics for low-width decompositions • In numerical linear algebra, one often wants to permute the rows of a matrix before
computing a factorization so that the resulting factors are as sparse as possible. The objective is to minimize the number of “fill edges” added.
Comparison of width and fill from 6 heuristics on graphs known to have tw <= 30
• For tree decompositions, we instead need to minimize the maximum clique size in the resulting chordal graph.
• Numerous implementations of common heuristics are available, and we tested several on a large set of random graphs with a fixed maximum width and varying sizes.
• Min-degree-based heuristics are orders of magnitude faster than min-fill, etc.
31 Managed by UT-Battelle for the U.S. Department of Energy
Empirical results: Synthetic
MCS Lower Bounds:
AMD Upper Bounds:
32 Managed by UT-Battelle for the U.S. Department of Energy
MCS Lower Bounds:
AMD Upper Bounds: More…
34 Managed by UT-Battelle for the U.S. Department of Energy
Empirical Results: Autonomous Systems
• A larger AS graph had similar results: 600K nodes resulted in a 200K largest connected component, and the upper bound was 5961, lower bound 32.
35 Managed by UT-Battelle for the U.S. Department of Energy
Problems with Using Tree Decompositions
• Every bag in a tree decomposition is a vertex separator, so a low-width decomposition means many small separators.
• Treewidth is O(n) w/ high probability for many random graphs (Gao 2009):
– Erdos-Renyi graphs G(n,m) when m/n > 1.073
– Random intersection graphs G(n,m,p) on universe {1,…m} with m=na, p at least 2/m and a > 0.
– Barabasi-Albert preferential attachment with at least 12 new edges for each additional vertex.
• Current heuristics get lost in “local noise”
36 Managed by UT-Battelle for the U.S. Department of Energy
Average k-cores on a tree decomposition
Temperature: 20 Power law exp: 2.1 Avg deg target: 5
37 Managed by UT-Battelle for the U.S. Department of Energy
Average k-cores on a tree decomposition
Temperature: 0.5 Power law exp: 2.1 Avg deg target: 20
39 Managed by UT-Battelle for the U.S. Department of Energy
What’s next?
• Clustering
• Diffusions
• Sparse Dimensionality Reduction
• Applications to Statistical Inference
40 Managed by UT-Battelle for the U.S. Department of Energy
Acknowledgements Primary support for this work through the ORNL Laboratory Directed Research & Development SEED Program.
These slides would not have been possible without many hours of hard work by Aaron Adcock.
42 Managed by UT-Battelle for the U.S. Department of Energy
Motivation for some improvements to min-degree
Minimum Degree
9 Eliminate
Eliminate 2
Minimum Fill-In
43 Managed by UT-Battelle for the U.S. Department of Energy
Tiebreaking with second neighbors
• Gloria investigated various strategies for breaking ties within min-degree and min-fill algorithms
• Her hypothesis was that including information about second-neighborhoods could improve the quality of these heuristics
• Even with optimizations, the running time of the improved algorithms was often significantly slower than random tie-breaks due to computation of additional information (fill or second-neighborhood sizes)
Joint work with Gloria D’Azevedo (ORHS student) and Chris Groer (ORNL).
44 Managed by UT-Battelle for the U.S. Department of Energy
MIND MIND+(0.5)(SEC)
An example where second neighbors help
Recommended