Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

Distributed algorithms for finding local clusters using heat

kernel pagerank

WAW 2015

EINDHOVEN, NETHERLANDS

Olivia Simpson, UC San Diego

Fan Chung, UC San Diego

Distributed graph processing

Allows for analysis of data too large to store on one machine

Need for adapting classical graph algorithms to distributed setting

Local cluster detection

• Find a cluster of well-connected, well-separated vertices near a particular vertex

• Use only local information; avoid querying the whole graph

Local cluster detection in the wild

• Identify competitors in an ad campaign

• Annotating protein structure

• Assign nodes to a particular clusterhead in a wireless sensor network

• Identify bottlenecks in a computer network

• Community detection

• Subroutines for bigger clustering tasks

• Global clustering (YouTube topic discovery)

• Overlapping communities (grow local communities)

Distributed computation

• Data is distributed across nodes (machines) of a network

• Nodes communicate over specified communication links in rounds

• Nodes are allowed to communicate small sized messages through the links

• Initially nodes know their identities and the identities of their neighbors

• Complete data is never known by any individual machine; no shared memory

Distributed algorithms

• Running time in terms of rounds of communication required for computation over arbitrary input

• Local communication is free

• Local computation is free

• Goal: optimize number of rounds

Note on notation

• “Data”

• Graph: vertices, edges, |V| = n, |E| = m

Undirected, uniformly weighted

• “Network”

• nodes (machines), links

Bidirectional communication

• A graph is input instance of a problem to be solved over machines of a network

CONGEST modelCommunication links are the edges of the input graph

Vertices of the graph are mapped to dedicated machines

Only allowed to send messages of size O(log n) bits

[Pandurangan, Khan ‘10], [Peleg ‘00]: Introduced to simulate bandwidth restrictions across a network

Application: compute bottlenecks in a network

k-machine modelA number of vertices may be mapped to a single machine

Network is “fixed”

Each machine executes an instance of distributed algorithm

Solution to a full problem is a configuration of outputs of each of the machines

Model simulates distributed graph computation systems like Pregel, Dato

Roadmap

1. Local cluster detection in the CONGEST distributed model

2. Conversion Theorem of [Klauck et al. ‘15] to convert algorithm to k-machine model

Roadmap



Diffusion-based local cluster detection





Want a set S with small

Cheeger ratio (conductance):

# 𝑒𝑑𝑔𝑒𝑠 𝑎𝑙𝑜𝑛𝑔 𝑏𝑜𝑟𝑑𝑒𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑠𝑒𝑡

𝑠𝑢𝑚 𝑜𝑓 𝑑𝑒𝑔𝑟𝑒𝑒𝑠 𝑤𝑖𝑡ℎ𝑖𝑛 𝑡ℎ𝑒 𝑠𝑒𝑡ɸ(S) =


A local clustering algorithm is based on the following guarantee:

If there exists a set of vertices S such that ɸ(S) ≤ϕ, then

many vertices in S may serve as seeds for finding a set T

with Cheeger ratio close to ϕ.




Use a diffusion process from the “seed” vertex to keep operations local


seed vertex

• “Score” vertices according to the probability that accumulates after some amount of diffusion from the seed vertex

The Sweep Algorithm

1. “Score” vertices according to the probability that accumulates after some amount of diffusion from the seed vertex

• Let N be the number of vertices with nonzero score, let s(v) denote the score of vertex v

2. Order vertices by score normalized by degree

s(v1)/d(v1) ≥ s(v2)/d(v2) ≥ … ≥ s(vN)/d(vN)

3. Check Cheeger ratio of each of the subsets induced by first j vertices (“sweep sets”) in the ordering

… …

Diffusions used to score vertices

• Lazy random walk [Lovász, Simonovitz ‘90, ‘93]

• Truncated lazy random walk [Spielman, Teng ‘04]

• PageRank [Andersen et al., ‘06]

• Evolving cluster sets with Markov chains [Andersen, Peres ‘09](evolving clusters with Markov chains)

• Lazy random walks + evolving sets [Gharan, Trevisan ‘12]

The Distributed Sweep Algorithm

1. Compute scores for each vertex in some number of rounds

2. Upcast scores and broadcast ordering to every node in O(n) rounds

3. Upcast (place in order, Lj, Rj) to a master node in O(n) rounds

4. Compute Cheeger ratio of each of the n – 1 cuts locally at master node

… …𝑺𝒋−𝟏

𝑺𝒋 𝒔𝒋

Lj Rj

Distributed Sweep with PageRank

1. Compute scores for each vertex in some number of rounds

2. Upcast scores and broadcast ordering to every other node in O(n) rounds

3. Upcast (place in order, Lj, Rj) to a master node in O(n) rounds

4. Compute Cheeger ratio of each of the n – 1 cuts locally at master node

[Das Sarma et al. ‘15] with PageRank:O(1/α log2 n + n log n) rounds of communication for any reset constant 0 < α < 1

Our diffusion: PHKPR

Personalized heat kernel pagerank is the expected distribution of the following “heat kernel random walk” process:

“take k random walk steps from the seed vertex with probability 𝑒−𝑡𝑡𝑘

𝑘!”

A Monte Carlo method for computing PHKPR in a centralized settingSince this is the expected distribution of a random walk process, approximate by sampling random walks. Call these values PHKPR scores.

ϕ = desired Cheeger ratio

t = f(ϕ)

for r times:

perform a heat kernel random walk (t) from the seed

return the number of times a walk ends at vertex v divided by r as the PHKPR score for v

Sweep with PHKPR in a centralized setting

[Chung, S. ‘14]

Sample r = 16/ε3 log n random walks, limit random walks to at most

K = O( log(1/𝜀)

log log( 1 𝜀)) steps captures PHKPR scores > ε.

In particular 1/ ε vertices have non-zero scores.

[Chung, S. ‘14]

WHP, a sweep using PHKPR will

return a set with Cheeger ratio O(ϕ1/2).

Distributed PHKPR scores

Distributed PHKPR scores

Launch r heat kernel random walks of length k in parallel

1. seed node initializes r tokens, each of which holds a random variable k and a counter

2. continue until the counter reaches k:nodes holding tokens pass tokens to random neighbors in rounds and increment corresponding counter each time

3. at end of K rounds, each node counts the number of tokens it holds divided by r as its PHKPR score

O(K = log(1/𝜀)

log log( 1 𝜀)) rounds

No congestion: worst case all r = O(log n) messages are sent in one edge

Distributed local cluster detection with PHKPR1. Compute PHKPR scores for each vertex in O(K) rounds

• N = O(1/ε) nodes have non-zero scores

2. Upcast scores and broadcast ordering to every other node in O(N) rounds

3. Upcast (place in order, Lj, Rj) to a master node in O(N) rounds

4. Compute Cheeger ratio of each of the N – 1 cuts locally at master node

… …𝑺𝒋−𝟏

𝑺𝒋 𝒔𝒋

Lj Rj

Distributed local cluster detection with PHKPR

This paper with PHKPR:O(K + N) rounds of communication if ϕ is given

[Das Sarma et al. ‘15] with PageRank:O(1/α log n + n ) rounds of communication for any reset constant 0 < α < 1 if ϕ is given

Speed to the next stop…



Summary

• Two distributed models of computation:• Conversion Theorem transforms algorithm in CONGEST model to

equivalent algorithm in k-machine model

• Local cluster detection:• Some number of rounds to compute scores• As long as messages are small, broadcasting and upcasting O(n)

messages take O(n) rounds• Cheeger ratios for all cuts can be computed locally

• Computing scores:• Based on sampling short random walks, perfect for parallelizing• Bottleneck becomes length of random walks, not number of samples

Distributed algorithms for finding local clusters using heat

kernel pagerank

WAW 2015

EINDHOVEN, NETHERLANDS

Olivia Simpson, UC San Diego

[email protected]

Documents

Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank