39
Weiren Yu 1 , Jiajin Le 2 , Xuemin Lin 1 , Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales & NICTA, Australia 2 Donghua University, China SSDBM 2012

Weiren Yu 1 , Jiajin Le 2 , Xuemin Lin 1 , Wenjie Zhang 1

  • Upload
    derora

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

SSDBM 2012. On the Efficiency of Estimating Penetrating Rank on Large Graphs. Weiren Yu 1 , Jiajin Le 2 , Xuemin Lin 1 , Wenjie Zhang 1. 1 University of New South Wales & NICTA, Australia 2 Donghua University, China. Contents. 4. Experimental Results. 1. Introduction. - PowerPoint PPT Presentation

Citation preview

Page 1: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

Weiren Yu1, Jiajin Le2, Xuemin Lin1, Wenjie Zhang1

On the Efficiency of Estimating Penetrating Rank on Large Graphs

1 University of New South Wales & NICTA, Australia

2 Donghua University, China

SSDBM 2012

Page 2: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

2. Problem Definition

Contents

4. Experimental Results

1. Introduction

3. Optimization Techniques

Page 3: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

P-Rank : A New Link-based Similarity Measure

Structural Similarity Measure

PageRank [Page et. al, 1999]

SimRank [Jeh and Widom, KDD 02]

P(enetrating)-Rank similarity

A new promising structural measure [Zhao et. al. , CIKM 09]

An extension of SimRank metrics

Basic Philosophy

Two entities are similar, if

(1) they are referenced by similar entities

(2) they reference similar entities

Page 4: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

P-Rank Overview

Features Avoiding “limited information problem” of SimRank

--- By taking account of both in- and out-links Defined recursively and is computed iteratively Applicable to any domain with object-to-object relationships

Challenges Costly to compute P-Rank on large graphs

Naïve Iteration O(Kn4) [Zhao et. al. , CIKM 09] Partial Sums Amortization O(Kn3) [Lizorkin et. al. ,

PVLDB 08]

Hard to estimate the error for P-Rank approximation Radius- and category-based Pruning Rule O(Kd2n2)

[Zhao et. al. , CIKM 09]

Page 5: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

P-Rank Formulation

Mathematical Formula

Iterative Paradigm

Page 6: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

Contributions

Characterizing P-Rank as two forms

matrix inversion --- deterministic optimization

power series --- probabilistic computation

Deterministic optimization (off-line)

eliminating neighborhood structure redundancy

quadratic-time with an error bound

Probabilistic computation (on-line)

a sampling approach

linear-time with controlled accuracy

Page 7: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1
Page 8: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

What is P-Rank?

The similarity in a domain can be modeled as graphs.

[ vertices objects , edges relationships ]

SimRank is an important similarity measure which exploits

the relationships between vertices on web graphs.

(Glen Jeh & Jennifer Widom , ’02)

Basic intuition:

Two objects are similar if their neighbors are similar.

(the recursive definition)

Objects are maximally similar to themselves.

(the base case )

Page 9: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

Existing Similarity measures

Textual-Content Similarity (text-based)

Vector-cosine similarity, Pearson correlation in IR

Structural-Context Similarity (link-based)

PageRank

• One page’s authority is decided by its neighbors’ authorities.

SimRank

• Two objects are similar if they are referenced by similar objects.

Page 10: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

2. Problem Definition

Contents

4. Experimental Results

1. Introduction

3. Optimization Techniques

Page 11: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

G vs. G2 Model Basic Graph Model: G = (V, E)

For each vertex v V , we define:∈

N(v): all the neighbors of vertex v

Ni(v): individual member of N(v)

Node-pair Graph: G2 = (V2, E2)

∀ (a, b) V∈ 2 represents a pair (a, b) of nodes in G.

∀ ⟨(a1, b1) , (a2, b2) E⟩ ∈ 2 denotes the edges a⟨ 1, a2 and b⟩ ⟨ 1, b2 exist in G.⟩

N(v) N(u)

v u

SimRank propagating similarity

from node to node in G is

associated with the propagation

from pair to pair in G2.

SimRank propagating similarity

from node to node in G is

associated with the propagation

from pair to pair in G2.

Page 12: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

SimRank Equation

Definition 1 (SimRank similarity)

Let s: V2 → [0, 1] be a similarity function on G2

• if a = b, s (a, b) = 1,

• if N(a) or N(b) = ,∅ s (a, b) = 0,

• otherwise:

where c is a decay factor btw. 0 & 1

1 1

, ,N a N b

i j

cs a b s N a N b

N a N b

Similarity btw. a & b is the average similarity btw. neighbors of a and neighbors

of b.

Page 13: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

Existing Techniques for SimRank Optimization

Deterministic Method [VLDB J. ’10, EDBT ’10, APWEB ’10, etc]

( computing s(∙, ∙) iteratively for finding a fixed point )

Advantage: accurate

Disadvantage: high time complexity (O(Kn3) in the worse case)

Probabilistic Method [WWW ’05, SIGIR ’06]

( estimating s(∙, ∙) stochastically by using Monte-Carlo )

s(a,b) = E (cT(a,b)) , where T (a,b) : the first meeting time btw. a & b

Advantage: scalable (linear time)

Disadvantage: low similarity quality

1

1 1

, ,N a N b

k k

i j

cs a b s N a N b

N a N b

Page 14: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

Existing Techniques for SimRank Deterministic Computation

Jeh and Widom first proposed a SimRank model, [WWW

’02]

taking O(Kn4) worst-case time.

Li et al. proposed a non-iterative approximate algorithm, [EDBT

’10]

yielding O(r4n2) time for dynamic information networks.

Lizorkin et al. used a partial sum function, [VLDB

’10]

reducing the time to O(Kn3) in the worst case.

Yu et al. showed a fast matrix multiplication for digraph, [APWEB

’10]

requiring O(K·min (m·n, nr)), where 2<r<log27.

Page 15: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

Motivations

The time required for SimRank deterministic algorithms

is still about cubic in the number of vertices for each

iteration, which is costly over large graphs.

As for SimRank deterministic computation, parallel

implementation has not been addressed in scientific

literature yet.

Page 16: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

Our Contributions

We present an efficient spectral decomposition based algorithm

for SimRank computation over undirected graphs, which reduces

the time complexity from O(Kn3) to O(n3 + Kn2).

We develop a block partition technique in combination with the

Parallel Linear Algebra Package (PLAPACK) to parallelize

SimRank algorithm on distributed memory multi-processors.

We perform extensive evaluations of our proposed methods

demonstrating the efficiency and effectiveness of our algorithms.

Page 17: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

2. Problem Definition

Contents

4. Experimental Results

1. Introduction

3. Optimization Techniques

Page 18: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

Efficient and Parallel SimRank Optimizations on Undirected Graphs

Your LOGO

3.1 AUG-SimRank

Page 19: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

Graph Spectrum

Definition 1. (Graph Spectrum) Given a web graph G, let QG

denote its transition probability matrix. The spectrum of G is

defined to be the set of the eigenvalues of QG. In symbols,

1 12 2

1 12 2

1 12 2

0 0 0 0

1 0 0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0 0 1

0 0 0 1 0 0

QG

1 1 13 3 3

1 12 2

1 12 2

1 1 13 3 3

1 12 2

1 12 2

0 0 0

0 0 0 0

0 0 0 0

0 0 0

0 0 0 0

0 0 0 0

QG

1

1

0.338 0.4892

0.338 0.4892

0.338 0.4892

0.338 0.4892

i

i

i

i

G

1

1

0.1667

0.1667

0.5

0.5

G

directed graph

undirected graph

Page 20: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

Graph Spectrum

For a digraph G,

some elements in σ(G) might be complex numbers.

For an undirected graph G,

all elements in σ(G) must be real numbers.

Theorem 1. Given an undirected graph G, all the eigenvalues of

its transition probability matrix QG are real numbers associated

with a complete set of orthonormal eigenvectors.

Page 21: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

Graph Spectrum Theorem 2. For an undirected graph G, let Q = U·Λ·U−1 be a

complete spectral decomposition of Q, where

U is an orthogonal matrix with real entities whose columns are

eigenvectors of Q,

Λ is a real diagonal matrix whose diagonal entities give the

corresponding eigenvalues.

Then we can construct the following iteration:

And SimRank similarity can be thereby obtained as follows:

Page 22: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

Key Observations 0. SimRank Matrix Representation

1. Spectral Predecomposition [Theorem 1]

2. Iterative Element-wise Matrix Multiplication

3. SimRank Matrix Computation

O(n3)

O(Kn2)

O(n3)

Page 23: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

Key Observations (cont.)

Notice that Λ· S ·Λ =[diag (Λ) · diag(Λ)T ] S ⊙ is our trick to reduce

the time complexity from O(n3) to O(n2) per iteration.

=

=

=

Page 24: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

AUG-SimRank Algorithm

Theorem 3. For undirected graphs, SimRank can be performed for K

iterations in O(n3 + K·n2) time in the worst case, where n is the

number of vertices, and n K.≫

Preconditioning techniques may be adopted when we calculate diag (Λ) · diag(Λ)T . Once computed, this rank-1 matrix is memorized and is therefore not recomputed when subsequently required.

Page 25: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

Efficient and Parallel SimRank Optimizations on Undirected Graphs

Your LOGO

3.2 PAUG-SimRank

Page 26: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

Parallel AUG-SimRank

To parallelize the AUG-SimRank algorithm, we utilize PLAPACK in

combination with matrix partition techniques on distributed memory

architectures.

PLAPACK is a parallel ARPACK version based on MPI (Message

Passing Interface) for constructing parallel linear algebra libraries.

It provides a high-level object-oriented programming interface.

The coding of parallel linear algebra routines becomes a

straightforward translation of algorithms.

Page 27: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

Parallel AUG-SimRank

In the spectral predecomposition phase,

Use a PLAPACK eigen-solver to decompose Q → U · Λ · U−1.

Partition the row vector diag(Λ)T → (Λ(1) Λ(2) · · · Λ(N) ).

In the iterative element-wise matrix multiplication phase,

Initialize the upper triangular part of M(i) ← c · diag (Λ) · Λ(i) .

Partition the similarity matrix as

Calculate each partition in parallel as

Page 28: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

Parallel AUG-SimRank (cont.)

In the SimRank matrix computation phase,

parallel computation of Sk can be performed by the following substeps:

A symmetric matrix-matrix multiplication

can be parallelized in PLAPACK.

The upper (or lower) triangular part of Sk can be updated as

hence, Sk can be computed as

Page 29: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

2. Problem Definition

Contents

4. Experimental Results

1. Introduction

3. Optimization Techniques

Page 30: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

Experimental Studies

Hardware

2.0GHz Pentium(R) Dual-Core / 2GB RAM

Windows Vista OS / Visual C++ 6.0

Data Sets

Synthetic

graph with an average of 8 links per page.

10 sample adjacency matrices from 1K to 10K

with ξ ∼uniform[0; 16] out-links on each row.

Real-life

Wikipedia (3.2M articles with 110M intra-wiki links / Oct. ’07)

We choose the relationship : “a category contains an article to

be a link btw. the category and the article”.

Page 31: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

Experimental Studies

Algorithms for Comparison

SimRank with partial sums. [VLDB ’10]

SOR SimRank. [APWEB ’10]

AUG SimRank. & Parallel AUG SimRank.

Evaluation Measures

CPU time : computational complexity

absolute speedup : parallel efficiency

Parameter Settings

c = 0.8, ω = 1.3, ϵ = 0.05

• p number of processors• T1 execution time of the sequential algorithm on one

processor• Tp time taken on p processors

Page 32: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

Time Efficiency Evaluation

Page 33: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

Time Efficiency Evaluation

Page 34: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

Time Efficiency Evaluation

Page 35: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

Time Efficiency Evaluation

Page 36: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

Time Efficiency Evaluation

Page 37: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

Parallel Efficiency Evaluation

Page 38: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1

Parallel Efficiency Evaluation

Page 39: Weiren  Yu 1 ,   Jiajin  Le 2 ,   Xuemin  Lin 1 ,   Wenjie  Zhang 1