Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Distributed algorithms for finding local clusters using heat
kernel pagerank
WAW 2015
EINDHOVEN, NETHERLANDS
Olivia Simpson, UC San Diego
Fan Chung, UC San Diego
Distributed graph processing
Allows for analysis of data too large to store on one machine
Need for adapting classical graph algorithms to distributed setting
Local cluster detection
• Find a cluster of well-connected, well-separated vertices near a particular vertex
• Use only local information; avoid querying the whole graph
Local cluster detection in the wild
• Identify competitors in an ad campaign
• Annotating protein structure
• Assign nodes to a particular clusterhead in a wireless sensor network
• Identify bottlenecks in a computer network
• Community detection
• Subroutines for bigger clustering tasks
• Global clustering (YouTube topic discovery)
• Overlapping communities (grow local communities)
Distributed computation
• Data is distributed across nodes (machines) of a network
• Nodes communicate over specified communication links in rounds
• Nodes are allowed to communicate small sized messages through the links
• Initially nodes know their identities and the identities of their neighbors
• Complete data is never known by any individual machine; no shared memory
Distributed algorithms
• Running time in terms of rounds of communication required for computation over arbitrary input
• Local communication is free
• Local computation is free
• Goal: optimize number of rounds
Note on notation
• “Data”
• Graph: vertices, edges, |V| = n, |E| = m
Undirected, uniformly weighted
• “Network”
• nodes (machines), links
Bidirectional communication
• A graph is input instance of a problem to be solved over machines of a network
CONGEST modelCommunication links are the edges of the input graph
Vertices of the graph are mapped to dedicated machines
Only allowed to send messages of size O(log n) bits
[Pandurangan, Khan ‘10], [Peleg ‘00]: Introduced to simulate bandwidth restrictions across a network
Application: compute bottlenecks in a network
k-machine modelA number of vertices may be mapped to a single machine
Network is “fixed”
Each machine executes an instance of distributed algorithm
Solution to a full problem is a configuration of outputs of each of the machines
Model simulates distributed graph computation systems like Pregel, Dato
Roadmap
1. Local cluster detection in the CONGEST distributed model
2. Conversion Theorem of [Klauck et al. ‘15] to convert algorithm to k-machine model
Roadmap
1. Local cluster detection in the CONGEST distributed model
2. Conversion Theorem of [Klauck et al. ‘15] to convert algorithm to k-machine model
Diffusion-based local cluster detection
• Find a cluster of well-connected, well-separated vertices near a particular vertex
• Use only local information; avoid querying the whole graph
Diffusion-based local cluster detection
• Find a cluster of well-connected, well-separated vertices near a particular vertex
Want a set S with small
Cheeger ratio (conductance):
# 𝑒𝑑𝑔𝑒𝑠 𝑎𝑙𝑜𝑛𝑔 𝑏𝑜𝑟𝑑𝑒𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑠𝑒𝑡
𝑠𝑢𝑚 𝑜𝑓 𝑑𝑒𝑔𝑟𝑒𝑒𝑠 𝑤𝑖𝑡ℎ𝑖𝑛 𝑡ℎ𝑒 𝑠𝑒𝑡ɸ(S) =
Diffusion-based local cluster detection
A local clustering algorithm is based on the following guarantee:
If there exists a set of vertices S such that ɸ(S) ≤ϕ, then
many vertices in S may serve as seeds for finding a set T
with Cheeger ratio close to ϕ.
Diffusion-based local cluster detection
• Find a cluster of well-connected, well-separated vertices near a particular vertex
• Use only local information; avoid querying the whole graph
Use a diffusion process from the “seed” vertex to keep operations local
Diffusion-based local cluster detection
seed vertex
• “Score” vertices according to the probability that accumulates after some amount of diffusion from the seed vertex
The Sweep Algorithm
1. “Score” vertices according to the probability that accumulates after some amount of diffusion from the seed vertex
• Let N be the number of vertices with nonzero score, let s(v) denote the score of vertex v
2. Order vertices by score normalized by degree
s(v1)/d(v1) ≥ s(v2)/d(v2) ≥ … ≥ s(vN)/d(vN)
3. Check Cheeger ratio of each of the subsets induced by first j vertices (“sweep sets”) in the ordering
… …
Diffusions used to score vertices
• Lazy random walk [Lovász, Simonovitz ‘90, ‘93]
• Truncated lazy random walk [Spielman, Teng ‘04]
• PageRank [Andersen et al., ‘06]
• Evolving cluster sets with Markov chains [Andersen, Peres ‘09](evolving clusters with Markov chains)
• Lazy random walks + evolving sets [Gharan, Trevisan ‘12]
The Distributed Sweep Algorithm
1. Compute scores for each vertex in some number of rounds
2. Upcast scores and broadcast ordering to every node in O(n) rounds
3. Upcast (place in order, Lj, Rj) to a master node in O(n) rounds
4. Compute Cheeger ratio of each of the n – 1 cuts locally at master node
… …𝑺𝒋−𝟏
𝑺𝒋 𝒔𝒋
Lj Rj
Distributed Sweep with PageRank
1. Compute scores for each vertex in some number of rounds
2. Upcast scores and broadcast ordering to every other node in O(n) rounds
3. Upcast (place in order, Lj, Rj) to a master node in O(n) rounds
4. Compute Cheeger ratio of each of the n – 1 cuts locally at master node
[Das Sarma et al. ‘15] with PageRank:O(1/α log2 n + n log n) rounds of communication for any reset constant 0 < α < 1
Our diffusion: PHKPR
Personalized heat kernel pagerank is the expected distribution of the following “heat kernel random walk” process:
“take k random walk steps from the seed vertex with probability 𝑒−𝑡𝑡𝑘
𝑘!”
A Monte Carlo method for computing PHKPR in a centralized settingSince this is the expected distribution of a random walk process, approximate by sampling random walks. Call these values PHKPR scores.
ϕ = desired Cheeger ratio
t = f(ϕ)
for r times:
perform a heat kernel random walk (t) from the seed
return the number of times a walk ends at vertex v divided by r as the PHKPR score for v
Sweep with PHKPR in a centralized setting
[Chung, S. ‘14]
Sample r = 16/ε3 log n random walks, limit random walks to at most
K = O( log(1/𝜀)
log log( 1 𝜀)) steps captures PHKPR scores > ε.
In particular 1/ ε vertices have non-zero scores.
[Chung, S. ‘14]
WHP, a sweep using PHKPR will
return a set with Cheeger ratio O(ϕ1/2).
Distributed PHKPR scores
Distributed PHKPR scores
Launch r heat kernel random walks of length k in parallel
1. seed node initializes r tokens, each of which holds a random variable k and a counter
2. continue until the counter reaches k:nodes holding tokens pass tokens to random neighbors in rounds and increment corresponding counter each time
3. at end of K rounds, each node counts the number of tokens it holds divided by r as its PHKPR score
O(K = log(1/𝜀)
log log( 1 𝜀)) rounds
No congestion: worst case all r = O(log n) messages are sent in one edge
Distributed local cluster detection with PHKPR1. Compute PHKPR scores for each vertex in O(K) rounds
• N = O(1/ε) nodes have non-zero scores
2. Upcast scores and broadcast ordering to every other node in O(N) rounds
3. Upcast (place in order, Lj, Rj) to a master node in O(N) rounds
4. Compute Cheeger ratio of each of the N – 1 cuts locally at master node
… …𝑺𝒋−𝟏
𝑺𝒋 𝒔𝒋
Lj Rj
Distributed local cluster detection with PHKPR
This paper with PHKPR:O(K + N) rounds of communication if ϕ is given
[Das Sarma et al. ‘15] with PageRank:O(1/α log n + n ) rounds of communication for any reset constant 0 < α < 1 if ϕ is given
Speed to the next stop…
1. Local cluster detection in the CONGEST distributed model
2. Conversion Theorem of [Klauck et al. ‘15] to convert algorithm to k-machine model
Summary
• Two distributed models of computation:• Conversion Theorem transforms algorithm in CONGEST model to
equivalent algorithm in k-machine model
• Local cluster detection:• Some number of rounds to compute scores• As long as messages are small, broadcasting and upcasting O(n)
messages take O(n) rounds• Cheeger ratios for all cuts can be computed locally
• Computing scores:• Based on sampling short random walks, perfect for parallelizing• Bottleneck becomes length of random walks, not number of samples
Distributed algorithms for finding local clusters using heat
kernel pagerank
WAW 2015
EINDHOVEN, NETHERLANDS
Olivia Simpson, UC San Diego