Lecture 8: Graph mining (Book Ch 17)

Preview:

Citation preview

Lecture 8: Graph mining (Book Ch 17)

Image source https://www.jamiesheffield.com/2013/05/concept-

mapping-my-protagonists-world.htmlMDM course Aalto 2020 – p.1/34

Graphs occur everywhere!

MDM course Aalto 2020 – p.2/34

Graphs occur everywhere!

image sources Leskovec et al. (2009), Fan & Simeon (2000)MDM course Aalto 2020 – p.3/34

Graph mining

Data may consist of

1. multiple small graphs→ today

e.g., chemical compounds, biological pathways,program control flows, consumer behaviour, ..., even(html) documents can be presented by graphs!

2. one large graph→ later

e.g., internet, social network

Information to mine: interesting substructures, similarities,

communities, clusters

MDM course Aalto 2020 – p.4/34

Concept map of methods (this lecture)

Distance

between graphs

Transformation

based methodsGraph matching

based methods

Frequent

subgraph based

methods

Graph

clustering

− type transport

− topological descr.

− kernel−similarity

uses

usesuses

− MCG dist

ances

− graph edit dist

ance

− type transport

− representatives from

frequent subgraphs

− K−medoids

− methods using

similarity graphs

needs

based methods

Distance

frequent subgraphs

Mining

matching

Graph

MDM course Aalto 2020 – p.5/34

Graph notations

G = (V,E) graph

V = {v1, . . . , vn} = setof vertices or nodes

|V| = number of nodes

node label l(vi)

E = {e1, . . . , em} = setof edges, ei = (v, u),v, u ∈ V

|E| = number of edges

Now we assume that edges undirected and don’t havelabels

MDM course Aalto 2020 – p.6/34

Graph matching = graph isomorphism

Two graphs G1 = (V1,E1) and G2 = (V2,E2) are matching orisomorphic iff there is a 1:1 correspondence betweennodes such that

(i) Corresponding nodes vi ∈ V1 and v j ∈ V2 havesame labels: l(vi) = l(v j).

(ii) Let [v1, u1] be a node pair in G1 and a correspond-ing pair [v2, u2] in G2. Then edge (v1, u1) ∈ E1 ⇔(v2, u2) ∈ E2.

Note: no polynomial time algorithms are known (exceptspecial cases)

MDM course Aalto 2020 – p.7/34

There can be many matchings!

Two matchings for molecules 1 and 2. Totally 4!=24matchings!

Image source: Aggarwal Ch 17MDM course Aalto 2020 – p.8/34

Subgraph isomorphism

Does a certain query graph Gq match a part of another

graph G?

Query graph Gq = (Vq,Eq) is a subgraph isomporphism of

G = (V,E), if

(i) For all vq ∈ Vq there is v ∈ V such that l(vq) = l(v); and

(ii) If [v1, u1] is a pair in Vq and [v2, u2] a matching pair in V,then (v1, u1) ∈ Eq ⇔ (v2, u2) ∈ E.

Sometimes a weaker condition suffices for (ii):if (v1, u1) ∈ Eq ⇒ (v2, u2) ∈ E

MDM course Aalto 2020 – p.9/34

Subgraph isomorphism: example

Algorithm: see Aggarwal Ch 17.2.1

Image source: Aggarwal Ch 17MDM course Aalto 2020 – p.10/34

Maximum common subgraph (MCG)

Problem: Given G1 and G2, find G0 = (V0,E0) such that

(i) G0 is a subgraph isomorphism of both G1 and G2 and

(i) |V0| is as large as possible.

+ useful for comparing graphs

– NP-hard (like subgraph isomorphism)

Algorithm: see Aggarwal Ch 17.2.2

MDM course Aalto 2020 – p.11/34

Next to distances

Frequent

subgraph based

methods

Graph

clustering

− type transport

− topological descr.

− kernel−similarity

uses

usesuses

− MCG dist

ances

− graph edit dist

ance

− type transport

− representatives from

frequent subgraphs

− K−medoids

− methods using

similarity graphs

needs

based methods

Distance

frequent subgraphs

Mining

matching

Graph

Distance

between graphs

Transformation

based methodsGraph matching

based methods

MDM course Aalto 2020 – p.12/34

Distances based on maximum common subgraph

(MCG)

Let’s assume graph size = number of nodes, i.e., forG = (V,E) notate |G| = |V|

Let MCS(G1,G2)=maximum common subgraph of G1

and G2 and |MCS(G1,G2)|=its size

1. Unnormalized non-matching measure:

U(G1,G2) = |G1| + |G2| − 2 · |MCS (G1,G2)|

= number on non-matching nodes

Problem: what if graphs have very different sizes?

MDM course Aalto 2020 – p.13/34

Distances based on MCG

2. Union-normalized distance Udist ∈ [0, 1]

Udist(G1,G2) = 1 −|MCS(G1,G2)|

|G1| + |G2| − |MCS(G1,G2)|

= number of non-matching nodes normalized by union size

3. Max-normalized distance Mdist ∈ [0, 1]

Mdist(G1,G2) = 1 −|MCS(G1,G2)|

max{|G1|, |G2|}

• metric

MCG-based distances can be computed efficiently only forsmall graphs!

MDM course Aalto 2020 – p.14/34

Graph edit distance

What is the minimum cost of edit operations to transformG1 to G2?

(i) node insertion

(ii) node deletion (deletes also incident edges)

(iii) edge insertion

(iv) edge deletion

(v) label substitution of nodes

application-specific costs

may be exponentially many possible edit paths!

NP-hard

MDM course Aalto 2020 – p.15/34

Graph edit distance: example

MDM course Aalto 2020 – p.16/34

Transformation-based distances

Idea: Transform graphs into a new space where distancesare easier to calculate

a) Type transport using frequent subgraphs

b) Topological descriptors

c) Kernel similarity

MDM course Aalto 2020 – p.17/34

Type transport using frequent subgraphs

subgraph

isomorphism

Search frequentCreate new features

Present graphs in

vector space usingUse text similarity

measures

fi = number of times

ith subgraph occurs in G

or binary or tf-idf presentation

Choose subgraphs

that don’t overlapsubgraphs

too much subgraphs

f1, . . . , fd for remaining

f1, . . . , fd

involves an NP-hard subproblem

MDM course Aalto 2020 – p.18/34

Topological descriptors

Idea: calculate different kinds of indices from graphs⇒new numerical features⇒ Use distances for numerical data

structural information lost

utility domain-specific (e.g., good in chemical domain)

e.g., Wiener index:

W(G) =∑

v,u∈V

d(v, u)

d(v, u)=length of shortest path from v to u

more in Aggarwal Ch 17.3.2

MDM course Aalto 2020 – p.19/34

Kernel similarity

Idea:

Assume transformation Φ such that similarity of G1 andG2 can be measured by Φ(G1) · Φ(G2)

Design kernel function K such thatK(G1,G2) = Φ(G1) ·Φ(G2) and use it as a similaritymeasure (without transformation)

e.g. shortest path kernel (O(n4)) and random walk

kernel (O(n6))

practical for small graphs

more in Aggarwal Ch 17.3.3

MDM course Aalto 2020 – p.20/34

Next to frequent subgraph discovery

Frequent

subgraph based

methods

Graph

clustering

− type transport

− topological descr.

− kernel−similarity

uses

usesuses

− MCG dist

ances

− graph edit dist

ance

− type transport

− representatives from

frequent subgraphs

− K−medoids

− methods using

similarity graphs

needs

based methods

Distance

frequent subgraphs

Mining

matching

Graph

Distance

between graphs

Transformation

based methodsGraph matching

based methods

MDM course Aalto 2020 – p.21/34

Frequent subgraph discovery: Motivation

Predict:

anti−HIV activity

toxicity of compounds

binding ability with

Anthrax toxin

Image source: https://slideplayer.com/slide/5894097/

MDM course Aalto 2020 – p.22/34

Frequent subgraph discovery

Task: Given graph database, search frequent subgraphsgiven threshold min f r.

Search idea: utilize monotonicity of frequency!

If G1 is a subgraph of G2, then fr(G1) ≥ fr(G2)

similar algorithms than for frequent itemsets, but morecomplex

two variants: size of graph may refer to a) number ofnodes b) number of edges⇒ how new candidates are generated

MDM course Aalto 2020 – p.23/34

GraphApriori algorithm

Fi = frequent subgraphs of size i, Ci = candidates

F1 = {G | where |G| = 1, P(G) ≥ min f r}; i = 1

while Fi , ∅

generate candidates Ci+1 from Fi

prune G ∈ Ci+1 if G has a subgraph G′ such that|G′| = i and G′ < Fi (=monotonicity criterion)

count frequencies fr(G), G ∈ Ci+1

set Fi+1 = {G ∈ Ci+1 | P(G) ≥ min f r}

i = i + 1

return ∪iFi

MDM course Aalto 2020 – p.24/34

GraphApriori: Candidate generation

For all G1,G2 ∈ Fi, |G1| = |G2| = i

1. determine if G1 and G2 have a common subgraph G0

of size i − 1

may be many isomorphic matchings ⇒ manyalternative G0s!

2. for each G0 create candidate graphs of size i + 1

node-based: include all common + 2 non-matchingnodes (with extra edge or not)

edge-based: include all i − 1 common edges and 2unique edges (with extra node or not)

same subgraphs may be generated multiple times⇒redundancy checking

MDM course Aalto 2020 – p.25/34

Example of node-based join

Image source: Aggarwal Ch 17MDM course Aalto 2020 – p.26/34

Example of edge-based join

Image source: Aggarwal Ch 17MDM course Aalto 2020 – p.27/34

Why this is heavy?

number of candidate patterns may be huge!

subgraph isomorphism to identify pairs of subgraphsfor joining

graph isomorphism for redundancy checking

subgraph isomorphism for monotonicity pruning

subgraph isomorphism for frequency counting

Easier if

many unique node labels

only small subgraphs are searched

edge-based join is used (usually less candidates)MDM course Aalto 2020 – p.28/34

Next to graph clustering

Graph

clustering

− type transport

− topological descr.

− kernel−similarity

uses

usesuses

− MCG dist

ances

− graph edit dist

ance

− type transport

− representatives from

frequent subgraphs

− K−medoids

− methods using

similarity graphs

needs

based methods

Distance

frequent subgraphs

Mining

matching

Graph

Distance

between graphs

Transformation

based methodsGraph matching

based methods

Frequent

subgraph based

methods

MDM course Aalto 2020 – p.29/34

Distance-based clustering methods

Common approaches:

1. K-medoids (needs just a distance function)

2. Spectral and other graph-based methods

construct a nearest neighbour/similarity graph ofgraph objects

cluster nodes of the new graph

Remember: graph distance measures very expensive tocompute! → suitable for smaller graphs

MDM course Aalto 2020 – p.30/34

Methods based on frequent subgraphs

Approach 1. Type transport: graphs→ multidimensional

subgraph

isomorphism

Search frequentCreate new features

Present graphs in

vector space usingUse text clustering

methods

fi = number of times

ith subgraph occurs in G

or binary or tf-idf representation

Choose subgraphs

that don’t overlapsubgraphs

too much subgraphs

f1, . . . , fd for remaining

f1, . . . , fd

involves an NP-hard subproblem

MDM course Aalto 2020 – p.31/34

Methods based on frequent subgraphs

Approach 2. XProj: cluster representatives = sets offrequent subgraphs

Initialization: Create K random clusters C1, . . . ,CK

for all Ci: Fi = set of frequent subgraphs (of a givensize) from Ci

repeat until convergence:

assign each G j to Ci where sim(G j,Fi) largest

for all Ci determine new Fi

sim(G j,Fi) = fraction of frequent graphs in Fi that occur in G j

MDM course Aalto 2020 – p.32/34

Summary

Distance

between graphs

Transformation

based methodsGraph matching

based methods

Frequent

subgraph based

methods

Graph

clustering

− type transport

− topological descr.

− kernel−similarity

uses

usesuses

− MCG dist

ances

− graph edit dist

ance

− type transport

− representatives from

frequent subgraphs

− K−medoids

− methods using

similarity graphs

needs

based methods

Distance

frequent subgraphs

Mining

matching

Graph

MDM course Aalto 2020 – p.33/34

Image sources

Leskovec et al.: Community Structure in LargeNetworks: Natural Cluster Sizes and the Absence ofLarge Well-Defined Clusters. Internet Mathematics 6,2008.

Fan & Simeon: Integrity Constraints for XML.Principles of database systems 2000.

Gudes: Graph and Web Mining – Motivation,Applications and Algorithms, Data mining seminar,2University of Helsinki 2010.

MDM course Aalto 2020 – p.34/34

Recommended