Upload
noah-bates
View
216
Download
2
Embed Size (px)
Citation preview
Graph reordering/partitioning with redundancy
Motivation• 1. distributed graph processing
–Use redundancy to reduce the costly communication–Reordering vertices such that vertex with the continuous orders are
grouped in the same machine• 2. external storage of graph data
–Use redundancy to reduce disk I/O caused by cross page access–Reordering vertices such that vertex with the continuous orders are
grouped in the same page• Generalized
–Reordering–Redundancy•Consider a vertex u with degree k, in the worst case, there will be k remote access or disk I/Os•By copying u to each machines on which its remote neighbor reside, we can avoid such remote access or disk I/Os.
Rationality• Why ordering instead of partitioning?
–Ordering provides more information than partitioning•Ordering implies partitioning– Suppose a vertex sequence V1,V2,...Vm. We are to partition these
vertices into P parts. A simple solution is to partition the sequence into P (consecutive) parts.
–Disk access•If two vertices (u,v) are logically close to each other, we hope it is arranged in close regions on disk so that we can reduce the disk seeking time when access v after u (or vice versa)•Two vertices are said logically close to each other if they reside in the same densely connected subgraph–In general, vertices are processed in a certain order. We
expect to process vertices in the above mentioned logical order
Problem definition 1 (p1)• Overlapping graph partitioning• Input: Given a graph G(V, E), an integer k, size
constraint Z,• Problem: Finding k Z-size subsets of V such
that• Objective:
–G(.)=e\in E f(e) is minimized–f(e)=0 if two ends of e occurs in a subset; 1
otherwise
Baseline model: Problem definition 2 (p2)
• Overlapping graph partitioning• Input: Given a graph G(V, E), integer m• Problem: Finding a sequence S(v1,…,vm) of V• Objectives
–Each v \in V appears at least once in the sequence–e\in E f(e) is minimized–F(e(u,v)) is defined as the minimal distance between u and v in
sequence• Advantage
–Compared to linearization, our model is independent on k, when a partition is smaller than k, linearization fails
– if we find a solution under our model, a random partitioning is expect to be good enough (total number of cross-parts edges are minimized )
Relationship between p1 and p2
• Under the model of problem 2, if we get an optimal solution, then
–If and only if given a random partitioning P over the sequence, the E[G(P)] is optimal (minimal) ???
Solution to p2• How to generate order
–Principle to guide the order generation•Traverse vertices in the same community first then outside of the community–Bfs–Dfs
• How to select vertices to copy?–Quatify the benefit and the cost to copy a vertex?–Set low/upper bound on degree of copied vertex•If the degree of a vertex is 1 or 2, it's obvious that we needn't store the information of this vertex multiple times; if the degree of a vertex is relatively low, the benefit of copying this vertex is low too•If the degree of a vertex is too large, the cost of copying this vertex is large (i.e Maybe we can use that storage to give other few vertices copies in order to benefit more?)
Bfs or Dfs?
• It seems that Bfs is not a good choice• The distance between two neighboring
vertices will be larger than that in Dfs sequence, especially in a graph with many vertices with large (relatively large) degree.
• If a back edge is found while doing dfs, should we copy the information of this vertex??? Or some other constraints is needed???
Solution to problem P1
• Based on solution p2• Partition the sequence into consecutive parts
by the size constraint Z.(The naiive solution)