30
Distributed algorithms for finding local clusters using heat kernel pagerank WAW 2015 EINDHOVEN, NETHERLANDS Olivia Simpson, UC San Diego Fan Chung, UC San Diego

Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

Distributed algorithms for finding local clusters using heat

kernel pagerank

WAW 2015

EINDHOVEN, NETHERLANDS

Olivia Simpson, UC San Diego

Fan Chung, UC San Diego

Page 2: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

Distributed graph processing

Allows for analysis of data too large to store on one machine

Need for adapting classical graph algorithms to distributed setting

Page 3: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

Local cluster detection

• Find a cluster of well-connected, well-separated vertices near a particular vertex

• Use only local information; avoid querying the whole graph

Page 4: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

Local cluster detection in the wild

• Identify competitors in an ad campaign

• Annotating protein structure

• Assign nodes to a particular clusterhead in a wireless sensor network

• Identify bottlenecks in a computer network

• Community detection

• Subroutines for bigger clustering tasks

• Global clustering (YouTube topic discovery)

• Overlapping communities (grow local communities)

Page 5: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

Distributed computation

• Data is distributed across nodes (machines) of a network

• Nodes communicate over specified communication links in rounds

• Nodes are allowed to communicate small sized messages through the links

• Initially nodes know their identities and the identities of their neighbors

• Complete data is never known by any individual machine; no shared memory

Page 6: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

Distributed algorithms

• Running time in terms of rounds of communication required for computation over arbitrary input

• Local communication is free

• Local computation is free

• Goal: optimize number of rounds

Page 7: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

Note on notation

• “Data”

• Graph: vertices, edges, |V| = n, |E| = m

Undirected, uniformly weighted

• “Network”

• nodes (machines), links

Bidirectional communication

• A graph is input instance of a problem to be solved over machines of a network

Page 8: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

CONGEST modelCommunication links are the edges of the input graph

Vertices of the graph are mapped to dedicated machines

Only allowed to send messages of size O(log n) bits

[Pandurangan, Khan ‘10], [Peleg ‘00]: Introduced to simulate bandwidth restrictions across a network

Application: compute bottlenecks in a network

Page 9: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

k-machine modelA number of vertices may be mapped to a single machine

Network is “fixed”

Each machine executes an instance of distributed algorithm

Solution to a full problem is a configuration of outputs of each of the machines

Model simulates distributed graph computation systems like Pregel, Dato

Page 10: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

Roadmap

1. Local cluster detection in the CONGEST distributed model

2. Conversion Theorem of [Klauck et al. ‘15] to convert algorithm to k-machine model

Page 11: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

Roadmap

1. Local cluster detection in the CONGEST distributed model

2. Conversion Theorem of [Klauck et al. ‘15] to convert algorithm to k-machine model

Page 12: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

Diffusion-based local cluster detection

• Find a cluster of well-connected, well-separated vertices near a particular vertex

• Use only local information; avoid querying the whole graph

Page 13: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

Diffusion-based local cluster detection

• Find a cluster of well-connected, well-separated vertices near a particular vertex

Want a set S with small

Cheeger ratio (conductance):

# 𝑒𝑑𝑔𝑒𝑠 𝑎𝑙𝑜𝑛𝑔 𝑏𝑜𝑟𝑑𝑒𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑠𝑒𝑡

𝑠𝑢𝑚 𝑜𝑓 𝑑𝑒𝑔𝑟𝑒𝑒𝑠 𝑤𝑖𝑡ℎ𝑖𝑛 𝑡ℎ𝑒 𝑠𝑒𝑡ɸ(S) =

Page 14: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

Diffusion-based local cluster detection

A local clustering algorithm is based on the following guarantee:

If there exists a set of vertices S such that ɸ(S) ≤ϕ, then

many vertices in S may serve as seeds for finding a set T

with Cheeger ratio close to ϕ.

Page 15: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

Diffusion-based local cluster detection

• Find a cluster of well-connected, well-separated vertices near a particular vertex

• Use only local information; avoid querying the whole graph

Use a diffusion process from the “seed” vertex to keep operations local

Page 16: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

Diffusion-based local cluster detection

seed vertex

• “Score” vertices according to the probability that accumulates after some amount of diffusion from the seed vertex

Page 17: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

The Sweep Algorithm

1. “Score” vertices according to the probability that accumulates after some amount of diffusion from the seed vertex

• Let N be the number of vertices with nonzero score, let s(v) denote the score of vertex v

2. Order vertices by score normalized by degree

s(v1)/d(v1) ≥ s(v2)/d(v2) ≥ … ≥ s(vN)/d(vN)

3. Check Cheeger ratio of each of the subsets induced by first j vertices (“sweep sets”) in the ordering

… …

Page 18: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

Diffusions used to score vertices

• Lazy random walk [Lovász, Simonovitz ‘90, ‘93]

• Truncated lazy random walk [Spielman, Teng ‘04]

• PageRank [Andersen et al., ‘06]

• Evolving cluster sets with Markov chains [Andersen, Peres ‘09](evolving clusters with Markov chains)

• Lazy random walks + evolving sets [Gharan, Trevisan ‘12]

Page 19: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

The Distributed Sweep Algorithm

1. Compute scores for each vertex in some number of rounds

2. Upcast scores and broadcast ordering to every node in O(n) rounds

3. Upcast (place in order, Lj, Rj) to a master node in O(n) rounds

4. Compute Cheeger ratio of each of the n – 1 cuts locally at master node

… …𝑺𝒋−𝟏

𝑺𝒋 𝒔𝒋

Lj Rj

Page 20: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

Distributed Sweep with PageRank

1. Compute scores for each vertex in some number of rounds

2. Upcast scores and broadcast ordering to every other node in O(n) rounds

3. Upcast (place in order, Lj, Rj) to a master node in O(n) rounds

4. Compute Cheeger ratio of each of the n – 1 cuts locally at master node

[Das Sarma et al. ‘15] with PageRank:O(1/α log2 n + n log n) rounds of communication for any reset constant 0 < α < 1

Page 21: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

Our diffusion: PHKPR

Personalized heat kernel pagerank is the expected distribution of the following “heat kernel random walk” process:

“take k random walk steps from the seed vertex with probability 𝑒−𝑡𝑡𝑘

𝑘!”

Page 22: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

A Monte Carlo method for computing PHKPR in a centralized settingSince this is the expected distribution of a random walk process, approximate by sampling random walks. Call these values PHKPR scores.

ϕ = desired Cheeger ratio

t = f(ϕ)

for r times:

perform a heat kernel random walk (t) from the seed

return the number of times a walk ends at vertex v divided by r as the PHKPR score for v

Page 23: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

Sweep with PHKPR in a centralized setting

[Chung, S. ‘14]

Sample r = 16/ε3 log n random walks, limit random walks to at most

K = O( log(1/𝜀)

log log( 1 𝜀)) steps captures PHKPR scores > ε.

In particular 1/ ε vertices have non-zero scores.

[Chung, S. ‘14]

WHP, a sweep using PHKPR will

return a set with Cheeger ratio O(ϕ1/2).

Page 24: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

Distributed PHKPR scores

Page 25: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

Distributed PHKPR scores

Launch r heat kernel random walks of length k in parallel

1. seed node initializes r tokens, each of which holds a random variable k and a counter

2. continue until the counter reaches k:nodes holding tokens pass tokens to random neighbors in rounds and increment corresponding counter each time

3. at end of K rounds, each node counts the number of tokens it holds divided by r as its PHKPR score

O(K = log(1/𝜀)

log log( 1 𝜀)) rounds

No congestion: worst case all r = O(log n) messages are sent in one edge

Page 26: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

Distributed local cluster detection with PHKPR1. Compute PHKPR scores for each vertex in O(K) rounds

• N = O(1/ε) nodes have non-zero scores

2. Upcast scores and broadcast ordering to every other node in O(N) rounds

3. Upcast (place in order, Lj, Rj) to a master node in O(N) rounds

4. Compute Cheeger ratio of each of the N – 1 cuts locally at master node

… …𝑺𝒋−𝟏

𝑺𝒋 𝒔𝒋

Lj Rj

Page 27: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

Distributed local cluster detection with PHKPR

This paper with PHKPR:O(K + N) rounds of communication if ϕ is given

[Das Sarma et al. ‘15] with PageRank:O(1/α log n + n ) rounds of communication for any reset constant 0 < α < 1 if ϕ is given

Page 28: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

Speed to the next stop…

1. Local cluster detection in the CONGEST distributed model

2. Conversion Theorem of [Klauck et al. ‘15] to convert algorithm to k-machine model

Page 29: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

Summary

• Two distributed models of computation:• Conversion Theorem transforms algorithm in CONGEST model to

equivalent algorithm in k-machine model

• Local cluster detection:• Some number of rounds to compute scores• As long as messages are small, broadcasting and upcasting O(n)

messages take O(n) rounds• Cheeger ratios for all cuts can be computed locally

• Computing scores:• Based on sampling short random walks, perfect for parallelizing• Bottleneck becomes length of random walks, not number of samples

Page 30: Distributed algorithms for finding local clusters using ...cseweb.ucsd.edu/~osimpson/waw15_simpson.pdf · Distributed algorithms for finding local clusters using heat kernel pagerank

Distributed algorithms for finding local clusters using heat

kernel pagerank

WAW 2015

EINDHOVEN, NETHERLANDS

Olivia Simpson, UC San Diego

[email protected]