Download pdf - Lecture 8: Graph mining (Book Ch 17)

Lecture 8: Graph mining (Book Ch 17)

Image source https://www.jamiesheffield.com/2013/05/concept-

mapping-my-protagonists-world.htmlMDM course Aalto 2020 – p.1/34

https://www.jamiesheffield.com/2013/05/concept-

mapping-my-protagonists-world.html

Graphs occur everywhere!

MDM course Aalto 2020 – p.2/34

Graphs occur everywhere!

image sources Leskovec et al. (2009), Fan & Simeon (2000)MDM course Aalto 2020 – p.3/34

Graph mining

Data may consist of

1. multiple small graphs→ today

e.g., chemical compounds, biological pathways,program control flows, consumer behaviour, ..., even(html) documents can be presented by graphs!

2. one large graph→ later

e.g., internet, social network

Information to mine: interesting substructures, similarities,

communities, clusters


Concept map of methods (this lecture)

Distance

between graphs

Transformation

based methodsGraph matching

based methods

Frequent

subgraph based

methods

Graph

clustering

− type transport

− topological descr.

− kernel−similarity

uses

usesuses

− MCG dist

ances

− graph edit dist

ance

− type transport

− representatives from

frequent subgraphs

− K−medoids

− methods using

similarity graphs

needs

based methods

Distance

frequent subgraphs

Mining

matching

Graph


Graph notations

G = (V,E) graph

V = {v1, . . . , vn} = setof vertices or nodes

|V| = number of nodes

node label l(vi)

E = {e1, . . . , em} = setof edges, ei = (v, u),v, u ∈ V

|E| = number of edges

Now we assume that edges undirected and don’t havelabels


Graph matching = graph isomorphism

Two graphs G1 = (V1,E1) and G2 = (V2,E2) are matching orisomorphic iff there is a 1:1 correspondence betweennodes such that

(i) Corresponding nodes vi ∈ V1 and v j ∈ V2 havesame labels: l(vi) = l(v j).

(ii) Let [v1, u1] be a node pair in G1 and a correspond-ing pair [v2, u2] in G2. Then edge (v1, u1) ∈ E1 ⇔(v2, u2) ∈ E2.

Note: no polynomial time algorithms are known (exceptspecial cases)


There can be many matchings!

Two matchings for molecules 1 and 2. Totally 4!=24matchings!

Image source: Aggarwal Ch 17MDM course Aalto 2020 – p.8/34

Subgraph isomorphism

Does a certain query graph Gq match a part of another

graph G?

Query graph Gq = (Vq,Eq) is a subgraph isomporphism of

G = (V,E), if

(i) For all vq ∈ Vq there is v ∈ V such that l(vq) = l(v); and

(ii) If [v1, u1] is a pair in Vq and [v2, u2] a matching pair in V,then (v1, u1) ∈ Eq ⇔ (v2, u2) ∈ E.

Sometimes a weaker condition suffices for (ii):if (v1, u1) ∈ Eq ⇒ (v2, u2) ∈ E


Subgraph isomorphism: example

Algorithm: see Aggarwal Ch 17.2.1


Maximum common subgraph (MCG)

Problem: Given G1 and G2, find G0 = (V0,E0) such that

(i) G0 is a subgraph isomorphism of both G1 and G2 and

(i) |V0| is as large as possible.

+ useful for comparing graphs

– NP-hard (like subgraph isomorphism)

Algorithm: see Aggarwal Ch 17.2.2


Next to distances

Frequent

subgraph based

methods

Graph

clustering

− type transport



uses

usesuses

− MCG dist

ances

− graph edit dist

ance

− type transport


frequent subgraphs

− K−medoids

− methods using

similarity graphs

needs

based methods

Distance

frequent subgraphs

Mining

matching

Graph

Distance

between graphs

Transformation


based methods


Distances based on maximum common subgraph

(MCG)

Let’s assume graph size = number of nodes, i.e., forG = (V,E) notate |G| = |V|

Let MCS(G1,G2)=maximum common subgraph of G1

and G2 and |MCS(G1,G2)|=its size

1. Unnormalized non-matching measure:

U(G1,G2) = |G1| + |G2| − 2 · |MCS (G1,G2)|

= number on non-matching nodes

Problem: what if graphs have very different sizes?


Distances based on MCG

2. Union-normalized distance Udist ∈ [0, 1]

Udist(G1,G2) = 1 −|MCS(G1,G2)|

|G1| + |G2| − |MCS(G1,G2)|

= number of non-matching nodes normalized by union size

3. Max-normalized distance Mdist ∈ [0, 1]

Mdist(G1,G2) = 1 −|MCS(G1,G2)|

max{|G1|, |G2|}

• metric

MCG-based distances can be computed efficiently only forsmall graphs!


Graph edit distance

What is the minimum cost of edit operations to transformG1 to G2?

(i) node insertion

(ii) node deletion (deletes also incident edges)

(iii) edge insertion

(iv) edge deletion

(v) label substitution of nodes

application-specific costs

may be exponentially many possible edit paths!

NP-hard


Graph edit distance: example


Transformation-based distances

Idea: Transform graphs into a new space where distancesare easier to calculate

a) Type transport using frequent subgraphs

b) Topological descriptors

c) Kernel similarity


Type transport using frequent subgraphs

subgraph

isomorphism

Search frequentCreate new features

Present graphs in

vector space usingUse text similarity

measures

fi = number of times

ith subgraph occurs in G

or binary or tf-idf presentation

Choose subgraphs

that don’t overlapsubgraphs

too much subgraphs

f1, . . . , fd for remaining

f1, . . . , fd

involves an NP-hard subproblem


Topological descriptors

Idea: calculate different kinds of indices from graphs⇒new numerical features⇒ Use distances for numerical data

structural information lost

utility domain-specific (e.g., good in chemical domain)

e.g., Wiener index:

W(G) =∑

v,u∈V

d(v, u)

d(v, u)=length of shortest path from v to u

more in Aggarwal Ch 17.3.2


Kernel similarity

Idea:

Assume transformation Φ such that similarity of G1 andG2 can be measured by Φ(G1) · Φ(G2)

Design kernel function K such thatK(G1,G2) = Φ(G1) ·Φ(G2) and use it as a similaritymeasure (without transformation)

e.g. shortest path kernel (O(n4)) and random walk

kernel (O(n6))

practical for small graphs

more in Aggarwal Ch 17.3.3


Next to frequent subgraph discovery

Frequent

subgraph based

methods

Graph

clustering

− type transport



uses

usesuses

− MCG dist

ances

− graph edit dist

ance

− type transport


frequent subgraphs

− K−medoids

− methods using

similarity graphs

needs

based methods

Distance

frequent subgraphs

Mining

matching

Graph

Distance

between graphs

Transformation


based methods


Frequent subgraph discovery: Motivation

Predict:

anti−HIV activity

toxicity of compounds

binding ability with

Anthrax toxin

Image source: https://slideplayer.com/slide/5894097/


https://slideplayer.com/slide/5894097/

Frequent subgraph discovery

Task: Given graph database, search frequent subgraphsgiven threshold min f r.

Search idea: utilize monotonicity of frequency!

If G1 is a subgraph of G2, then fr(G1) ≥ fr(G2)

similar algorithms than for frequent itemsets, but morecomplex

two variants: size of graph may refer to a) number ofnodes b) number of edges⇒ how new candidates are generated


GraphApriori algorithm

Fi = frequent subgraphs of size i, Ci = candidates

F1 = {G | where |G| = 1, P(G) ≥ min f r}; i = 1

while Fi , ∅

generate candidates Ci+1 from Fi

prune G ∈ Ci+1 if G has a subgraph G′ such that|G′| = i and G′ < Fi (=monotonicity criterion)

count frequencies fr(G), G ∈ Ci+1

set Fi+1 = {G ∈ Ci+1 | P(G) ≥ min f r}

i = i + 1

return ∪iFi


GraphApriori: Candidate generation

For all G1,G2 ∈ Fi, |G1| = |G2| = i

1. determine if G1 and G2 have a common subgraph G0

of size i − 1

may be many isomorphic matchings ⇒ manyalternative G0s!

2. for each G0 create candidate graphs of size i + 1

node-based: include all common + 2 non-matchingnodes (with extra edge or not)

edge-based: include all i − 1 common edges and 2unique edges (with extra node or not)

same subgraphs may be generated multiple times⇒redundancy checking


Example of node-based join


Example of edge-based join


Why this is heavy?

number of candidate patterns may be huge!

subgraph isomorphism to identify pairs of subgraphsfor joining

graph isomorphism for redundancy checking

subgraph isomorphism for monotonicity pruning

subgraph isomorphism for frequency counting

Easier if

many unique node labels

only small subgraphs are searched

edge-based join is used (usually less candidates)MDM course Aalto 2020 – p.28/34

Next to graph clustering

Graph

clustering

− type transport



uses

usesuses

− MCG dist

ances

− graph edit dist

ance

− type transport


frequent subgraphs

− K−medoids

− methods using

similarity graphs

needs

based methods

Distance

frequent subgraphs

Mining

matching

Graph

Distance

between graphs

Transformation


based methods

Frequent

subgraph based

methods


Distance-based clustering methods

Common approaches:

1. K-medoids (needs just a distance function)

2. Spectral and other graph-based methods

construct a nearest neighbour/similarity graph ofgraph objects

cluster nodes of the new graph

Remember: graph distance measures very expensive tocompute! → suitable for smaller graphs


Methods based on frequent subgraphs

Approach 1. Type transport: graphs→ multidimensional

subgraph

isomorphism

Search frequentCreate new features

Present graphs in

vector space usingUse text clustering

methods

fi = number of times

ith subgraph occurs in G

or binary or tf-idf representation

Choose subgraphs

that don’t overlapsubgraphs

too much subgraphs

f1, . . . , fd for remaining

f1, . . . , fd

involves an NP-hard subproblem


Methods based on frequent subgraphs

Approach 2. XProj: cluster representatives = sets offrequent subgraphs

Initialization: Create K random clusters C1, . . . ,CK

for all Ci: Fi = set of frequent subgraphs (of a givensize) from Ci

repeat until convergence:

assign each G j to Ci where sim(G j,Fi) largest

for all Ci determine new Fi

sim(G j,Fi) = fraction of frequent graphs in Fi that occur in G j


Summary

Distance

between graphs

Transformation


based methods

Frequent

subgraph based

methods

Graph

clustering

− type transport



uses

usesuses

− MCG dist

ances

− graph edit dist

ance

− type transport


frequent subgraphs

− K−medoids

− methods using

similarity graphs

needs

based methods

Distance

frequent subgraphs

Mining

matching

Graph


Image sources

Leskovec et al.: Community Structure in LargeNetworks: Natural Cluster Sizes and the Absence ofLarge Well-Defined Clusters. Internet Mathematics 6,2008.

Fan & Simeon: Integrity Constraints for XML.Principles of database systems 2000.

Gudes: Graph and Web Mining – Motivation,Applications and Algorithms, Data mining seminar,2University of Helsinki 2010.