Graph reordering/partitioning with redundancy. Motivation 1. distributed graph processing – Use redundancy to reduce the costly communication – Reordering

Graph reordering/partitioning with redundancy

Motivation• 1. distributed graph processing

–Use redundancy to reduce the costly communication–Reordering vertices such that vertex with the continuous orders are

grouped in the same machine• 2. external storage of graph data

–Use redundancy to reduce disk I/O caused by cross page access–Reordering vertices such that vertex with the continuous orders are

grouped in the same page• Generalized

–Reordering–Redundancy•Consider a vertex u with degree k, in the worst case, there will be k remote access or disk I/Os•By copying u to each machines on which its remote neighbor reside, we can avoid such remote access or disk I/Os.

Rationality• Why ordering instead of partitioning?

–Ordering provides more information than partitioning•Ordering implies partitioning– Suppose a vertex sequence V1,V2,...Vm. We are to partition these

vertices into P parts. A simple solution is to partition the sequence into P (consecutive) parts.

–Disk access•If two vertices (u,v) are logically close to each other, we hope it is arranged in close regions on disk so that we can reduce the disk seeking time when access v after u (or vice versa)•Two vertices are said logically close to each other if they reside in the same densely connected subgraph–In general, vertices are processed in a certain order. We

expect to process vertices in the above mentioned logical order

Problem definition 1 (p1)• Overlapping graph partitioning• Input: Given a graph G(V, E), an integer k, size

constraint Z,• Problem: Finding k Z-size subsets of V such

that• Objective:

–G(.)=e\in E f(e) is minimized–f(e)=0 if two ends of e occurs in a subset; 1

otherwise

Baseline model: Problem definition 2 (p2)

• Overlapping graph partitioning• Input: Given a graph G(V, E), integer m• Problem: Finding a sequence S(v1,…,vm) of V• Objectives

–Each v \in V appears at least once in the sequence–e\in E f(e) is minimized–F(e(u,v)) is defined as the minimal distance between u and v in

sequence• Advantage

–Compared to linearization, our model is independent on k, when a partition is smaller than k, linearization fails

– if we find a solution under our model, a random partitioning is expect to be good enough (total number of cross-parts edges are minimized )

Relationship between p1 and p2

• Under the model of problem 2, if we get an optimal solution, then

–If and only if given a random partitioning P over the sequence, the E[G(P)] is optimal (minimal) ???

Solution to p2• How to generate order

–Principle to guide the order generation•Traverse vertices in the same community first then outside of the community–Bfs–Dfs

• How to select vertices to copy?–Quatify the benefit and the cost to copy a vertex?–Set low/upper bound on degree of copied vertex•If the degree of a vertex is 1 or 2, it's obvious that we needn't store the information of this vertex multiple times; if the degree of a vertex is relatively low, the benefit of copying this vertex is low too•If the degree of a vertex is too large, the cost of copying this vertex is large (i.e Maybe we can use that storage to give other few vertices copies in order to benefit more?)

Bfs or Dfs?

• It seems that Bfs is not a good choice• The distance between two neighboring

vertices will be larger than that in Dfs sequence, especially in a graph with many vertices with large (relatively large) degree.

• If a back edge is found while doing dfs, should we copy the information of this vertex??? Or some other constraints is needed???

Solution to problem P1

• Based on solution p2• Partition the sequence into consecutive parts

by the size constraint Z.(The naiive solution)

Documents

Graph reordering/partitioning with redundancy. Motivation 1. distributed graph processing – Use redundancy to reduce the costly communication – Reordering