Lecture 8: Graph mining (Book Ch 17)
Image source https://www.jamiesheffield.com/2013/05/concept-
mapping-my-protagonists-world.htmlMDM course Aalto 2020 – p.1/34
Graphs occur everywhere!
MDM course Aalto 2020 – p.2/34
Graphs occur everywhere!
image sources Leskovec et al. (2009), Fan & Simeon (2000)MDM course Aalto 2020 – p.3/34
Graph mining
Data may consist of
1. multiple small graphs→ today
e.g., chemical compounds, biological pathways,program control flows, consumer behaviour, ..., even(html) documents can be presented by graphs!
2. one large graph→ later
e.g., internet, social network
Information to mine: interesting substructures, similarities,
communities, clusters
MDM course Aalto 2020 – p.4/34
Concept map of methods (this lecture)
Distance
between graphs
Transformation
based methodsGraph matching
based methods
Frequent
subgraph based
methods
Graph
clustering
− type transport
− topological descr.
− kernel−similarity
uses
usesuses
− MCG dist
ances
− graph edit dist
ance
− type transport
− representatives from
frequent subgraphs
− K−medoids
− methods using
similarity graphs
needs
based methods
Distance
frequent subgraphs
Mining
matching
Graph
MDM course Aalto 2020 – p.5/34
Graph notations
G = (V,E) graph
V = {v1, . . . , vn} = setof vertices or nodes
|V| = number of nodes
node label l(vi)
E = {e1, . . . , em} = setof edges, ei = (v, u),v, u ∈ V
|E| = number of edges
Now we assume that edges undirected and don’t havelabels
MDM course Aalto 2020 – p.6/34
Graph matching = graph isomorphism
Two graphs G1 = (V1,E1) and G2 = (V2,E2) are matching orisomorphic iff there is a 1:1 correspondence betweennodes such that
(i) Corresponding nodes vi ∈ V1 and v j ∈ V2 havesame labels: l(vi) = l(v j).
(ii) Let [v1, u1] be a node pair in G1 and a correspond-ing pair [v2, u2] in G2. Then edge (v1, u1) ∈ E1 ⇔(v2, u2) ∈ E2.
Note: no polynomial time algorithms are known (exceptspecial cases)
MDM course Aalto 2020 – p.7/34
There can be many matchings!
Two matchings for molecules 1 and 2. Totally 4!=24matchings!
Image source: Aggarwal Ch 17MDM course Aalto 2020 – p.8/34
Subgraph isomorphism
Does a certain query graph Gq match a part of another
graph G?
Query graph Gq = (Vq,Eq) is a subgraph isomporphism of
G = (V,E), if
(i) For all vq ∈ Vq there is v ∈ V such that l(vq) = l(v); and
(ii) If [v1, u1] is a pair in Vq and [v2, u2] a matching pair in V,then (v1, u1) ∈ Eq ⇔ (v2, u2) ∈ E.
Sometimes a weaker condition suffices for (ii):if (v1, u1) ∈ Eq ⇒ (v2, u2) ∈ E
MDM course Aalto 2020 – p.9/34
Subgraph isomorphism: example
Algorithm: see Aggarwal Ch 17.2.1
Image source: Aggarwal Ch 17MDM course Aalto 2020 – p.10/34
Maximum common subgraph (MCG)
Problem: Given G1 and G2, find G0 = (V0,E0) such that
(i) G0 is a subgraph isomorphism of both G1 and G2 and
(i) |V0| is as large as possible.
+ useful for comparing graphs
– NP-hard (like subgraph isomorphism)
Algorithm: see Aggarwal Ch 17.2.2
MDM course Aalto 2020 – p.11/34
Next to distances
Frequent
subgraph based
methods
Graph
clustering
− type transport
− topological descr.
− kernel−similarity
uses
usesuses
− MCG dist
ances
− graph edit dist
ance
− type transport
− representatives from
frequent subgraphs
− K−medoids
− methods using
similarity graphs
needs
based methods
Distance
frequent subgraphs
Mining
matching
Graph
Distance
between graphs
Transformation
based methodsGraph matching
based methods
MDM course Aalto 2020 – p.12/34
Distances based on maximum common subgraph
(MCG)
Let’s assume graph size = number of nodes, i.e., forG = (V,E) notate |G| = |V|
Let MCS(G1,G2)=maximum common subgraph of G1
and G2 and |MCS(G1,G2)|=its size
1. Unnormalized non-matching measure:
U(G1,G2) = |G1| + |G2| − 2 · |MCS (G1,G2)|
= number on non-matching nodes
Problem: what if graphs have very different sizes?
MDM course Aalto 2020 – p.13/34
Distances based on MCG
2. Union-normalized distance Udist ∈ [0, 1]
Udist(G1,G2) = 1 −|MCS(G1,G2)|
|G1| + |G2| − |MCS(G1,G2)|
= number of non-matching nodes normalized by union size
3. Max-normalized distance Mdist ∈ [0, 1]
Mdist(G1,G2) = 1 −|MCS(G1,G2)|
max{|G1|, |G2|}
• metric
MCG-based distances can be computed efficiently only forsmall graphs!
MDM course Aalto 2020 – p.14/34
Graph edit distance
What is the minimum cost of edit operations to transformG1 to G2?
(i) node insertion
(ii) node deletion (deletes also incident edges)
(iii) edge insertion
(iv) edge deletion
(v) label substitution of nodes
application-specific costs
may be exponentially many possible edit paths!
NP-hard
MDM course Aalto 2020 – p.15/34
Graph edit distance: example
MDM course Aalto 2020 – p.16/34
Transformation-based distances
Idea: Transform graphs into a new space where distancesare easier to calculate
a) Type transport using frequent subgraphs
b) Topological descriptors
c) Kernel similarity
MDM course Aalto 2020 – p.17/34
Type transport using frequent subgraphs
subgraph
isomorphism
Search frequentCreate new features
Present graphs in
vector space usingUse text similarity
measures
fi = number of times
ith subgraph occurs in G
or binary or tf-idf presentation
Choose subgraphs
that don’t overlapsubgraphs
too much subgraphs
f1, . . . , fd for remaining
f1, . . . , fd
involves an NP-hard subproblem
MDM course Aalto 2020 – p.18/34
Topological descriptors
Idea: calculate different kinds of indices from graphs⇒new numerical features⇒ Use distances for numerical data
structural information lost
utility domain-specific (e.g., good in chemical domain)
e.g., Wiener index:
W(G) =∑
v,u∈V
d(v, u)
d(v, u)=length of shortest path from v to u
more in Aggarwal Ch 17.3.2
MDM course Aalto 2020 – p.19/34
Kernel similarity
Idea:
Assume transformation Φ such that similarity of G1 andG2 can be measured by Φ(G1) · Φ(G2)
Design kernel function K such thatK(G1,G2) = Φ(G1) ·Φ(G2) and use it as a similaritymeasure (without transformation)
e.g. shortest path kernel (O(n4)) and random walk
kernel (O(n6))
practical for small graphs
more in Aggarwal Ch 17.3.3
MDM course Aalto 2020 – p.20/34
Next to frequent subgraph discovery
Frequent
subgraph based
methods
Graph
clustering
− type transport
− topological descr.
− kernel−similarity
uses
usesuses
− MCG dist
ances
− graph edit dist
ance
− type transport
− representatives from
frequent subgraphs
− K−medoids
− methods using
similarity graphs
needs
based methods
Distance
frequent subgraphs
Mining
matching
Graph
Distance
between graphs
Transformation
based methodsGraph matching
based methods
MDM course Aalto 2020 – p.21/34
Frequent subgraph discovery: Motivation
Predict:
anti−HIV activity
toxicity of compounds
binding ability with
Anthrax toxin
Image source: https://slideplayer.com/slide/5894097/
MDM course Aalto 2020 – p.22/34
Frequent subgraph discovery
Task: Given graph database, search frequent subgraphsgiven threshold min f r.
Search idea: utilize monotonicity of frequency!
If G1 is a subgraph of G2, then fr(G1) ≥ fr(G2)
similar algorithms than for frequent itemsets, but morecomplex
two variants: size of graph may refer to a) number ofnodes b) number of edges⇒ how new candidates are generated
MDM course Aalto 2020 – p.23/34
GraphApriori algorithm
Fi = frequent subgraphs of size i, Ci = candidates
F1 = {G | where |G| = 1, P(G) ≥ min f r}; i = 1
while Fi , ∅
generate candidates Ci+1 from Fi
prune G ∈ Ci+1 if G has a subgraph G′ such that|G′| = i and G′ < Fi (=monotonicity criterion)
count frequencies fr(G), G ∈ Ci+1
set Fi+1 = {G ∈ Ci+1 | P(G) ≥ min f r}
i = i + 1
return ∪iFi
MDM course Aalto 2020 – p.24/34
GraphApriori: Candidate generation
For all G1,G2 ∈ Fi, |G1| = |G2| = i
1. determine if G1 and G2 have a common subgraph G0
of size i − 1
may be many isomorphic matchings ⇒ manyalternative G0s!
2. for each G0 create candidate graphs of size i + 1
node-based: include all common + 2 non-matchingnodes (with extra edge or not)
edge-based: include all i − 1 common edges and 2unique edges (with extra node or not)
same subgraphs may be generated multiple times⇒redundancy checking
MDM course Aalto 2020 – p.25/34
Example of node-based join
Image source: Aggarwal Ch 17MDM course Aalto 2020 – p.26/34
Example of edge-based join
Image source: Aggarwal Ch 17MDM course Aalto 2020 – p.27/34
Why this is heavy?
number of candidate patterns may be huge!
subgraph isomorphism to identify pairs of subgraphsfor joining
graph isomorphism for redundancy checking
subgraph isomorphism for monotonicity pruning
subgraph isomorphism for frequency counting
Easier if
many unique node labels
only small subgraphs are searched
edge-based join is used (usually less candidates)MDM course Aalto 2020 – p.28/34
Next to graph clustering
Graph
clustering
− type transport
− topological descr.
− kernel−similarity
uses
usesuses
− MCG dist
ances
− graph edit dist
ance
− type transport
− representatives from
frequent subgraphs
− K−medoids
− methods using
similarity graphs
needs
based methods
Distance
frequent subgraphs
Mining
matching
Graph
Distance
between graphs
Transformation
based methodsGraph matching
based methods
Frequent
subgraph based
methods
MDM course Aalto 2020 – p.29/34
Distance-based clustering methods
Common approaches:
1. K-medoids (needs just a distance function)
2. Spectral and other graph-based methods
construct a nearest neighbour/similarity graph ofgraph objects
cluster nodes of the new graph
Remember: graph distance measures very expensive tocompute! → suitable for smaller graphs
MDM course Aalto 2020 – p.30/34
Methods based on frequent subgraphs
Approach 1. Type transport: graphs→ multidimensional
subgraph
isomorphism
Search frequentCreate new features
Present graphs in
vector space usingUse text clustering
methods
fi = number of times
ith subgraph occurs in G
or binary or tf-idf representation
Choose subgraphs
that don’t overlapsubgraphs
too much subgraphs
f1, . . . , fd for remaining
f1, . . . , fd
involves an NP-hard subproblem
MDM course Aalto 2020 – p.31/34
Methods based on frequent subgraphs
Approach 2. XProj: cluster representatives = sets offrequent subgraphs
Initialization: Create K random clusters C1, . . . ,CK
for all Ci: Fi = set of frequent subgraphs (of a givensize) from Ci
repeat until convergence:
assign each G j to Ci where sim(G j,Fi) largest
for all Ci determine new Fi
sim(G j,Fi) = fraction of frequent graphs in Fi that occur in G j
MDM course Aalto 2020 – p.32/34
Summary
Distance
between graphs
Transformation
based methodsGraph matching
based methods
Frequent
subgraph based
methods
Graph
clustering
− type transport
− topological descr.
− kernel−similarity
uses
usesuses
− MCG dist
ances
− graph edit dist
ance
− type transport
− representatives from
frequent subgraphs
− K−medoids
− methods using
similarity graphs
needs
based methods
Distance
frequent subgraphs
Mining
matching
Graph
MDM course Aalto 2020 – p.33/34
Image sources
Leskovec et al.: Community Structure in LargeNetworks: Natural Cluster Sizes and the Absence ofLarge Well-Defined Clusters. Internet Mathematics 6,2008.
Fan & Simeon: Integrity Constraints for XML.Principles of database systems 2000.
Gudes: Graph and Web Mining – Motivation,Applications and Algorithms, Data mining seminar,2University of Helsinki 2010.
MDM course Aalto 2020 – p.34/34