Lecture 8: Graph mining (Book Ch 17)

Image source https://www.jamiesheffield.com/2013/05/concept-

mapping-my-protagonists-world.htmlMDM course Aalto 2020 – p.1/34

Graphs occur everywhere!

MDM course Aalto 2020 – p.2/34

Graphs occur everywhere!

image sources Leskovec et al. (2009), Fan & Simeon (2000)MDM course Aalto 2020 – p.3/34

Graph mining

Data may consist of

1. multiple small graphs→ today

e.g., chemical compounds, biological pathways,program control flows, consumer behaviour, ..., even(html) documents can be presented by graphs!

2. one large graph→ later

e.g., internet, social network

Information to mine: interesting substructures, similarities,

communities, clusters

Concept map of methods (this lecture)

Distance

between graphs

Transformation

based methodsGraph matching

based methods

Frequent

subgraph based

methods

clustering

− type transport

− topological descr.

− kernel−similarity

usesuses

− MCG dist

− graph edit dist

− type transport

− representatives from

frequent subgraphs

− K−medoids

− methods using

similarity graphs

based methods

Distance

frequent subgraphs

Mining

matching

Graph notations

G = (V,E) graph

V = {v1, . . . , vn} = setof vertices or nodes

|V| = number of nodes

node label l(vi)

E = {e1, . . . , em} = setof edges, ei = (v, u),v, u ∈ V

|E| = number of edges

Now we assume that edges undirected and don’t havelabels

Graph matching = graph isomorphism

Two graphs G1 = (V1,E1) and G2 = (V2,E2) are matching orisomorphic iff there is a 1:1 correspondence betweennodes such that

(i) Corresponding nodes vi ∈ V1 and v j ∈ V2 havesame labels: l(vi) = l(v j).

(ii) Let [v1, u1] be a node pair in G1 and a correspond-ing pair [v2, u2] in G2. Then edge (v1, u1) ∈ E1 ⇔(v2, u2) ∈ E2.

Note: no polynomial time algorithms are known (exceptspecial cases)

There can be many matchings!

Two matchings for molecules 1 and 2. Totally 4!=24matchings!

Image source: Aggarwal Ch 17MDM course Aalto 2020 – p.8/34

Subgraph isomorphism

Does a certain query graph Gq match a part of another

graph G?

Query graph Gq = (Vq,Eq) is a subgraph isomporphism of

G = (V,E), if

(i) For all vq ∈ Vq there is v ∈ V such that l(vq) = l(v); and

(ii) If [v1, u1] is a pair in Vq and [v2, u2] a matching pair in V,then (v1, u1) ∈ Eq ⇔ (v2, u2) ∈ E.

Sometimes a weaker condition suffices for (ii):if (v1, u1) ∈ Eq ⇒ (v2, u2) ∈ E

Subgraph isomorphism: example

Algorithm: see Aggarwal Ch 17.2.1

Maximum common subgraph (MCG)

Problem: Given G1 and G2, find G0 = (V0,E0) such that

(i) G0 is a subgraph isomorphism of both G1 and G2 and

(i) |V0| is as large as possible.

+ useful for comparing graphs

– NP-hard (like subgraph isomorphism)

Algorithm: see Aggarwal Ch 17.2.2

Next to distances

Frequent

subgraph based

methods

clustering

− type transport

usesuses

− MCG dist

− graph edit dist

− type transport

frequent subgraphs

− K−medoids

− methods using

similarity graphs

based methods

Distance

frequent subgraphs

Mining

matching

Distance

between graphs

Transformation

based methods

Distances based on maximum common subgraph

Let’s assume graph size = number of nodes, i.e., forG = (V,E) notate |G| = |V|

Let MCS(G1,G2)=maximum common subgraph of G1

and G2 and |MCS(G1,G2)|=its size

1. Unnormalized non-matching measure:

U(G1,G2) = |G1| + |G2| − 2 · |MCS (G1,G2)|

= number on non-matching nodes

Problem: what if graphs have very different sizes?

Distances based on MCG

2. Union-normalized distance Udist ∈ [0, 1]

Udist(G1,G2) = 1 −|MCS(G1,G2)|

|G1| + |G2| − |MCS(G1,G2)|

= number of non-matching nodes normalized by union size

3. Max-normalized distance Mdist ∈ [0, 1]

Mdist(G1,G2) = 1 −|MCS(G1,G2)|

max{|G1|, |G2|}

• metric

MCG-based distances can be computed efficiently only forsmall graphs!

Graph edit distance

What is the minimum cost of edit operations to transformG1 to G2?

(i) node insertion

(ii) node deletion (deletes also incident edges)

(iii) edge insertion

(iv) edge deletion

(v) label substitution of nodes

application-specific costs

may be exponentially many possible edit paths!

NP-hard

Graph edit distance: example

Transformation-based distances

Idea: Transform graphs into a new space where distancesare easier to calculate

a) Type transport using frequent subgraphs

b) Topological descriptors

c) Kernel similarity

Type transport using frequent subgraphs

subgraph

isomorphism

Search frequentCreate new features

Present graphs in

vector space usingUse text similarity

measures

fi = number of times

ith subgraph occurs in G

or binary or tf-idf presentation

Choose subgraphs

that don’t overlapsubgraphs

too much subgraphs

f1, . . . , fd for remaining

f1, . . . , fd

involves an NP-hard subproblem

Topological descriptors

Idea: calculate different kinds of indices from graphs⇒new numerical features⇒ Use distances for numerical data

structural information lost

utility domain-specific (e.g., good in chemical domain)

e.g., Wiener index:

W(G) =∑

v,u∈V

d(v, u)

d(v, u)=length of shortest path from v to u

Lecture 8: Graph mining (Book Ch 17)

Documents

Omni Graph Mining: Graph mining using RDBMS

Large Graph Mining

Graph - Ch 4

Graph Mining - PageRank

Mining Graph Patterns

Graph Mining: Social network analysis and Information ... · Graph Mining: Social network analysis and Information Diffusion Graph Mining course Winter Semester 2016 Davide Mottin,

Graph Mining - GitHub Pages

06. graph mining

Graph and Web Mining - Motivation, Applications and ... · Outline Basic concepts of Data Mining and Association rules Apriori algorithm Sequence mining Motivation for Graph Mining

Graph Mining and Graph Kernels - Homepage | ETH Zürich · 2014-10-29 · Graph Mining and Graph Kernels Karsten Borgwardt and Xifeng Yan | Biological Network Analysis: Graph Mining|

Mining Graph Patterns - UCSB Computer Sciencexyan/classes/CS595D-2009winter... · 2009. 1. 28. · Mining Graph Patterns. Why mine graph patterns? •Direct Use: –Mining over-represented

Complex network models, graph mining and information ...doktori.bibl.u-szeged.hu/4034/3/0_LA_booklet_eng.pdf · graph-based data mining, usually known simply as graph mining, is the

Graph Mining and Social Networks

Lecture 8: Graph Data Mining

Graph Mining Approach for Large-Scale Data Analysis Junichiro Mori, Associate Professor Location Hongo Research Area Large-scale Graph Mining Graph Mining Approach for Large-Scale

Graph Mining and Workflow Mining

Graph Essentials Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Graph Essentials Graph Basics

Data Mining-Graph Mining

Graph Mining - Social Network - Multi-relation Mining

Graph Mining, Social Network Analysis, and Multirelational ... · Graph Mining, Social Network9 Analysis, and Multirelational Data Mining We have studied frequent-itemset mining in