13
Discovering Similar Frequent Fragments in Drug Design: A Clustering-Based Approach Burcu Yılmaz and Mehmet G ¨ okt¨ urk Gebze Institute of Technology, Department of Computer Engineering, Gebze, Kocaeli, Turkey {byilmaz,gokturk}@gyte.edu.tr Abstract. Designing new medical drugs requires analysis of many molecules that have an activity for a specific disease. The main goal of these extensive anal- yses is to discover active substructures (fragment) that account for the activity of these molecules. Once these fragments are discovered, they are used to synthesize new drugs for the disease. Current approaches for discovering active fragments are heavily based on the frequent subgraph mining algorithms that search for ex- actly repeating morphological substructures within a graph database. However, in this paper, we argue that, in many settings, active fragments do not repeat exactly the same but with some fine differences. This prevents frequent subgraph mining approaches to discover these fragments. In this work, we propose a clustering based approach to discover similar substructures that repeat in active molecules in a molecular graph database. We have experimentally compared our approach with the current methods using real-life and synthesized datasets. Our experi- ments show that the proposed approach is successful in determining fragments that are responsible for the desired biological activity and unlike other methods it can determine frequent substructures that repeat in the graphs with some fine differences. 1 Introduction Molecular fragment mining is an application area of graph mining. Using the graph theory [1], a molecule is represented with a graph where nodes and edges represent atoms and bonds respectively. With the methods in graph mining, some significant ac- tive fragments are extracted from molecular data. An active fragment (pharmacophore) is a set of similar structural features in the structure of most of the active molecules (or medicines) that is responsible for their biological activity. Similarly using the graph theory, an active fragment is represented as a subgraph that exists in the structure of an active molecule represented with a graph. Then, these extracted fragments are used as a key in design of new medicines. In the recent years, as the increase in the number of projects on health industry, many methods are proposed for graph mining and applied to molecular data mining for drug design. One of the basic deficiencies in the methods proposed in graph-based data mining is their narrow view on the problem. These methods search for exactly same common frag- ments that exist in high percentage of huge number of molecules. However, molecules may have fragments that give them activity with respect to a specific disease, but these fragments may not repeat exactly in each active fragment. Instead, these fragments may

Discovering Similar Frequent Fragments in Drug Design: A Clustering-Based Approach

Embed Size (px)

Citation preview

Discovering Similar Frequent Fragments in DrugDesign: A Clustering-Based Approach

Burcu Yılmaz and Mehmet Gokturk

Gebze Institute of Technology, Department of Computer Engineering, Gebze, Kocaeli, Turkey{byilmaz,gokturk}@gyte.edu.tr

Abstract. Designing new medical drugs requires analysis of many moleculesthat have an activity for a specific disease. The main goal of these extensive anal-yses is to discover active substructures (fragment) that account for the activity ofthese molecules. Once these fragments are discovered, they are used to synthesizenew drugs for the disease. Current approaches for discovering active fragmentsare heavily based on the frequent subgraph mining algorithms that search for ex-actly repeating morphological substructures within a graph database. However, inthis paper, we argue that, in many settings, active fragments do not repeat exactlythe same but with some fine differences. This prevents frequent subgraph miningapproaches to discover these fragments. In this work, we propose a clusteringbased approach to discover similar substructures that repeat in active moleculesin a molecular graph database. We have experimentally compared our approachwith the current methods using real-life and synthesized datasets. Our experi-ments show that the proposed approach is successful in determining fragmentsthat are responsible for the desired biological activity and unlike other methodsit can determine frequent substructures that repeat in the graphs with some finedifferences.

1 Introduction

Molecular fragment mining is an application area of graph mining. Using the graphtheory [1], a molecule is represented with a graph where nodes and edges representatoms and bonds respectively. With the methods in graph mining, some significant ac-tive fragments are extracted from molecular data. An active fragment (pharmacophore)is a set of similar structural features in the structure of most of the active molecules(or medicines) that is responsible for their biological activity. Similarly using the graphtheory, an active fragment is represented as a subgraph that exists in the structure of anactive molecule represented with a graph. Then, these extracted fragments are used asa key in design of new medicines. In the recent years, as the increase in the number ofprojects on health industry, many methods are proposed for graph mining and appliedto molecular data mining for drug design.

One of the basic deficiencies in the methods proposed in graph-based data mining istheir narrow view on the problem. These methods search for exactly same common frag-ments that exist in high percentage of huge number of molecules. However, moleculesmay have fragments that give them activity with respect to a specific disease, but thesefragments may not repeat exactly in each active fragment. Instead, these fragments may

repeat with some deviations, which imply the necessity of subgraph mining methodsthat do not only discover exactly repeating substructures but also the ones that repeatwith some deviations.

In this paper, we focus on the problem of finding similar fragments in graph basedmolecular datasets. More specifically, in this paper, we propose a clustering-based graphmining method for similar frequent substructure extraction, where similar frequent frag-ments are discovered in addition to exactly repeating ones. In our approach, graph rep-resentations of molecules are projected into a 3D feature space (named atom-bond-atomspace). Each bond of a molecule (an edge in the graph) is represented as a point in thisspace. When we project all of the molecules with a certain activity into this space, the re-sulting points compose clusters of bonds that are repeated in the structures of the activemolecules. After that the clusters can be discovered by various clustering algorithms.Once the clusters are discovered, similar frequent substructures of active molecules arecomputed. We test our proposed method using the tuberculosis dataset from the liter-ature [2, 3]. Also, we used synthetic datasets where varied amount of noise is applied.So, we have tested our approach in a noisy condition close to real life datasets includ-ing measurements as features. We empirically compare our approach with six otherapproaches from the literature. Our experiments show that our approach can correctlydetermine active fragments that account for the activity of those molecules and preciselyfind similar frequent substructures.

The rest of the paper is organized as follows. Section 2 introduces frequent sub-graph mining problem with definitions and examples; then it extends these definitionsto define a new problem: Frequent Similar Subgraph Mining. In Section 3, we describethe proposed approach in detail with examples. In Section 4, we experimentally evalu-ate our approach using a real-life dataset and synthetic datasets as well as we discussthe performance of the proposed approach with respect to other approaches from theliterature. Section 5 concludes the paper with further research directions.

2 Mining Molecular Graphs For Similar Substructures

In this section, we introduce formal definitions related to our approach and define theproblem of mining similar substructures in molecular graphs. For this purpose, we firstdescribe classical frequent subgraph mining problem and then we extend it to the prob-lem of finding frequent similar subgraphs.

2.1 Frequent Subgraph Mining

A undirected labeled graph is defined as a five element tuple G = (V, E, LV , LE, �)where V is a set of vertices and E ∈ V × V is a set of undirected edges. LV and LE

are the sets of vertex and edge labels, respectively. The labeling function � defines themappings V → LV and E → LE . The labels do not have to be unique; hence thesame label can be assigned to many vertices (or edges). Labels can be nominals (e.g.,Oxygen, Zinc), integers (e.g., 1, . . . , n) or real numbers. A graph is a subgraph ofanother graph if only if all of its edges and nodes are covered by the latter as describedformally in Definition 1.

Definition 1 Given G = (V, E, LV , LE, �) and G′ = (V ′, E′, L′V , L′

E, �′), G is asubgraph of G′ if and only if

1. V ⊆ V ′,2. ∀u ∈ V, (�(u) = �′(u)),3. E ⊆ E′,4. ∀(u, v) ∈ E, (�(u, v) = �′(u, v))

Two graphs G and G′ are isomorphic if they are topologically identical to eachother (see Definition 2). A graph G is subgraph isomorphic to another graph G ′

(denoted as G ⊆ G′) if and only ifthere exists a subgraph G′′ of G′ suchthat G is isomorphic to G′′. Considerthe example in Figure 1. In the figure,there are two molecular graphs p andq, where the nodes represent the atomsor compounds and edges represent thebonds between them. In this example,the nodes are labeled using the type ofthe corresponding atom or compoundwhile edges are labeled as either singlebond or double bond. The graph q is iso-

C

Cl

C

CC

C

C

OH

C

C

C

p1

p2

p3

p5

p4

p6

p7

p8

q1

q2

q3

Fig. 1: Subgraph isomorphism example.

morphic to several subgraphs of p. There are six different subgraphs of p that is isomor-phic to q (e.g., {p2,p4,p6}, {p4,p2,p3} and so on). Therefore, we can conclude that q issubgraph isomorphic to p.

Definition 2 A labeled graph G = (V, E, LV , LE , �) is isomorphic to another labeledgraph G′ = (V ′, E′, L′

V , L′E, �′) iff there exists a bijection f : V → V ′ such that

1. ∀u ∈ V : �(u) = �′(f(u),2. ∀u, v ∈ V : (u, v) ∈ E ⇔ (f(u), f(v)) ∈ E′,3. ∀(u, v) ∈ E : �(u, v) = �′(f(u), f(v))

Given a graph database DB and a threshold σ (0 < σ ≤ 1), the graph G is frequentin DB iff supG ≥ σ, where supG is the support value of G and defined as the ratio ofgraphs in DB to which G is subgraph isomorphic (supG = |{G′∈DB|G⊆G′}|

|DB| . Underthe light of this definition, the frequent subgraph mining problem is defined as findingall frequent subgraphs in a graph database DB given a threshold σ.

2.2 Frequent Similar Subgraph Mining

Frequent subgraph mining is an NP-complete problem and an important challenge forthe researches. It has diverse and many real-life application areas such as moleculardata mining, drug design, social analysis and so on. In molecular data mining and drugdesign, molecules are represented as graphs and frequent subgraphs are mined to findthe substructures that provide a certain property to the molecules. Similarly, social sci-entists use frequent subgraph mining to reveal interesting common interaction patternsbetween the individuals in different communities.

Frequent fragment mining searches for exactly the same morphological structuresrepeating within a graph database (using subgraph isomorphism). On the other hand,in many real-life settings, a pattern or substructures may frequently repeat in differentgraphs with some fine differences. If these differences are not significant for the domainor the application, these similar substructures may still be of interest. However, frequentsubgraph mining algorithms cannot be used to discover these repeating patterns.

Drug design is an example of the domains where frequent substructures may re-peat with some differences. In Figure 1, we represent a molecule as a graph wherenodes and the edges are labeled ac-cording to their types; hence, repeatingsubgraphs can be determined easily ac-cording to these nominal labels. Whilemining molecules for frequent patterns,the researchers do not consider onlythe types of the atoms or compounds amolecule contains. They also use fea-tures of these atoms and compounds, aswell as the properties of the bonds be-tween them. Hence, they create graphsrepresenting molecules where labels are

0.025

0.195

-0.18

-0.20.015

0.023

0.014

-0.26

0.022

-0.2

0.016

p1

p2

p3

p5

p4

p6

p7

p8

q1

q2

q3

0.906

1.442

1.455

1.4041.401

1.451

1.441

1.013

1.453

1.402

Fig. 2: Subgraph similarity example.

not nominal values but real numbers representing the features of the atoms, compoundsand bonds within the molecule. Features of the atoms and bonds within a molecule mayslightly change in different settings (i.e., depending on the temperature, orientation andso on). Therefore, the graphs derived from the same molecule in different physical con-ditions may not be exactly the same. Similarly, the same type of atoms within a moleculemay have different properties depending on their positions within the molecule. Con-sider Figure 2 where the same molecules in Figure 1 are represented using the labelsthat refer to properties of the atoms, compounds and bonds (e.g., electrical properties,Wiberg’s index of bonds) instead of their types. When we compare graphs p and q, wecannot directly see any exactly repeating pattern. Therefore, a frequent graph miningapproach will not find any subgraph of p matching q. On the other hand, if we care-fully examine p and q, we immediately recognize that some subgraphs of p are not thesame but very similar to q (e.g., {p7,p5,p3} ≈ {q1,q2,q3}). Similar patterns may givevery important clues to the researchers about the structure-activity relationships of themolecules while considering only exactly the same patters is misleading and incom-plete.

In this paper, we argue that discovering similar substructures are as important asdiscovering exactly the same substructures in many real-life settings and applicationdomains. Therefore, we extend the problem of frequent subgraph mining to define anew problem, Frequent Similar Subgraph Mining. For this purpose, we first introducea definition of similarity between two graphs in Definition 3.

Definition 3 A labeled graph G = (V, E, LV , LE , �) is similar to another labeledgraph G′ = (V ′, E′, L′

V , L′E, �′) iff there exists a bijection f : V → V ′ such that

1. ∀u ∈ V : fsim(�(u), �′(f(u)) ≥ α,2. ∀u, v ∈ V : (u, v) ∈ E ⇔ (f(u), f(v)) ∈ E′,

3. ∀(u, v) ∈ E : fsim(�(u, v), �′(f(u), f(v)) ≥ α,where fsim is a function that computes the similarity between two labels and 0 ≤ α ≤ 1is a threshold that determines whether two labels are similar or not. Similarity be-tween labels x and y is defined as a real number within the range of [0 − 1], wherefsim(x, y) = 1 means that x is equal to y while fsim(x, y) = 0 implies that x and yare completely dissimilar. Hence, if α = 1.0, only the isomorphic graphs are similar

A graph G is subgraph similar to another graph G ′ with respect to a threshold α

(denoted as Gα⊆G′) if and only if there exists a subgraph G ′′ of G′ such that G is similar

to G′′ with respect to α. Based on this, we introduce the notion of frequent similarsubgraph in Definition 4. That is, a graph is frequent similar subgraph if it is similarsubgraph to a significant number of the graphs in a molecular database. The frequentsimilar subgraph mining problem is defined as finding all frequent similar subgraphsin a graph database DB given thresholds σ and α. Note that, when α = 0, frequentsimilar subgraph mining is reduced to frequent subgraph mining.

Definition 4 Given a graph database DB and the thresholds (0 < σ, α ≤ 1), the graphG is frequent similar subgraph in DB iff supα

G ≥ σ, where supαG is the support value

of G and defined as the ratio of graphs in DB to which G is subgraph similar with

respect to α (supαG = |{G′∈DB|G

α

⊆G′}||DB| )

3 Discovering Frequent Similar Substructures of Molecules

Previous researches mostly use frequent subgraph mining methods to discover frequentactive fragments that accounts for the activity of the molecules for a specific disease [4,5]. However, these methods can only find exactly repeating fragments, which is notalways the case in many settings. In this section, we propose a data mining approachto discover similar frequent fragments of active molecules. Our approach depends onsimilarity-based clustering of the atoms and bonds in a graph-based molecular datasetand statistical analysis of these clusters. For this purpose, we first make a transformationfrom graph representations to 3D feature space called atom-bond-atom space and thenuse clustering to determine similar substructures repeating in the graphs.

3.1 Transformation from Graph Representations to a 3D Coordinate System

There are different models for representing molecular graphs. One of these methodsis Electron-Topological Matrices of Conjugency (ETMC), which is effectively used indrug design applications [6]. In this study, we assume that molecules are representedas graphs using the ETMC model; however, the proposed approach can also work withother molecular graph representation models. In ETMC model, a molecule with n atomsis represented with a number of upper triangular fully weighted ETMC matrices, eachrepresenting various characteristics of molecules. A formulation of ETMC matrices isshown in Equation 1. The main advantages of the ETM are that its molecular repre-sentation reflects a molecule’s properties such as its electric and 3D conformational

properties and do not depend on the numbers and types of the atoms it contains.

ETMC =

⎡⎢⎢⎢⎢⎣

a1,1 a1,2 . . . a1,n−1 a1,n

a2,2 . . . a2,n−1 a2,n

. . . . . . . . .an−1,n−1 an−1,n

an,n

⎤⎥⎥⎥⎥⎦

(1)

In our study, diagonal elements ai,i (i = j) of an ETMC matrix contain informa-tion about electrical properties of atoms and non-diagonal elements a i,j (i = j) includeinformation about chemical properties of bonds between the corresponding atoms inETMC matrix representation. If there is no bond, the distance between two atoms isused instead. For various characteristics of molecules (e.g., bond properties such asvalue of Wiberg’s index), additional ETMC matrices can be formed similarly in thismanner using different features. Hence, different descriptors of molecules can be ex-amined for their accountability on activity (for a specific disease).

For every integer values of i and j (1 ≤ i, j ≤ n), vectors [a i,j , min(ai,i,aj,j),max(ai,i,aj,j)] are obtained from ETMC matrices. Hence, corresponding moleculesare split into pieces (bonds), where each bond is represented with a vector including in-formation about the bond (ai,j) and two atoms at each ends (ai,i and aj,j). These piecesare plotted with a point, (ai,j , min(ai,i, aj,j), max(ai,i, aj,j)) using a transformationfrom ETMC matrices to 3D coordinate system as shown in Figure 3. The resulting 3D-space representation of the graphs enables us to apply a range of clustering and datamining approaches on the graph datasets. In the following section, we propose a datamining approach that uses the points derived from ETMC matrices.

Fig. 3. Transformation of data to 3D space

3.2 Extracting Similar Fragments Using Clustering

We determine active fragments of the active molecules as follows. First, the bonds ofeach active molecule are transformed into 3D atom-bond-atom space, where featuresof each bond (and the atoms at its each end) are represented as a single point. Hence,an active molecule with m remaining bonds is represented with m points. Each pointis in the form of (ai,j , min(ai,i, aj,j), max(ai,i, aj,j)), where ai,j , ai,i, and aj,j arevalues from the ETMC matrix of the molecule and correspond to feature values of abond and feature values of the atoms connected by this bond, respectively. In this way,bonds of the active molecules are transformed onto the 3D atom-bond-atom space. Theresult is a set of points that are distributed over the space, where the points representingbonds with similar features are in proximity. Figure 4 shows an example where the datafrom active molecules in the anti-tuberculosis [2] dataset is plotted onto the 3D-space.This dataset is composed of active and inactive molecules, but we only show the pointsderived from active molecules in the figure. Data from inactive molecules is used asexplained below to eliminate some frequent fragments that do not have any influenceon the activity.

Fig. 4. Transformation of the active molecules in the antituberculosis dataset [2] ontothe 3D atom-bond-atom space.

After the transformation, the points are not distributed randomly in the space, butin a way that groups of points representing bonds with similar properties appear. Wename these groups of points as candidate activity clusters, which are used to derivecandidates of active fragments. In order to find candidates of activity clusters, we useaverage-link clustering method [7, 8]. In this clustering method, initially each point inthe space is regarded as an individual cluster. Then, clusters are merged iteratively ac-cording to the distances between the cluster centers. That is, two clusters are mergedso that the distance between any two points in the resulting cluster would not be biggerthan a predefined threshold. Hence, the points within the same cluster are similar to oneanother with respect to the similarity measure; moreover the molecular pieces (atoms

and bonds) corresponding to these points are also similar in terms of their features. Sim-ilarity measure depends on the function fsim and the similarity threshold α. Clusteringis ended when there are not any two clusters that can be merged.

Each of found clusters is composed of points that represent similar bond patterns ofactive molecules. Hence, the determined clusters represent the similar substructures thatexist in active molecules. However, all of these similar substructures may not be con-sidered as active fragments, because some of them may also be repeated in the inactivemolecules. Active fragments should be the substructures that exist frequently in activemolecules, but rarely in inactive molecules. This means that we are looking for clus-ters that are composed of bond patterns that are not repeated in the inactive molecules.Therefore, after determining clusters, for each cluster, we compute the percentage of ac-tive and inactive molecules that contain the molecular pieces in the cluster. The clustersincluding higher percentages of active molecule bonds and small percentage of inactivemolecule bonds are regarded as activity clusters and the bonds enclosed by the activeclusters are expected to compose active fragments.

In order to determine, which molecules fall into the clusters, we model each clusterusing a Minimum Volume Ellipsoid that encloses every point within the cluster. For thispurpose we have used the Khachiyan’s algorithm [9]. After computing the minimumvolume enclosing ellipsoid, we can easily determine whether a point p = {p x,py ,pz}falls into the cluster or not using Equation 2, where r = [rx,ry ,rz ] and c = [cx,cy ,cz]are the parameters of the ellipsoid. Figure 5 shows the clusters derived from the pointsin Figure 4 and the corresponding minimum volume ellipsoids.

(px − cx)2

rx+

(py − cy)2

ry+

(pz − cz)2

rz≤ 1 (2)

Fig. 5. Clusters for active molecules and the corresponding minimum volume ellipsoids.

4 Evaluation and Discussion

In order to demonstrate our approach better, we design realistic experiments with real-life data and simulations on the synthesized graphs. In our experiments, we have usedanti-tuberculosis dataset [2, 3]. This dataset is composed of 33 molecules (13 activeand 20 inactive molecules). In order to examine our approach extensively, we also testit with synthetic datasets of varying sizes. We empirically compare our approach withthe well-known approaches from the literature; SUBDUE [10], FSG [11], gSpan [12],Gaston [13], MoFa [14], and FFSM [15]. We present our results in Section 4.1 andSection 4.2.

4.1 Benchmark Comparisons: Anti-tuberculosis Dataset

In this section, using the anti-tuberculosis dataset, we compare our approach with twowell-known approaches from the literature; SUBDUE and FSG. These approaches arechosen, because they are often used in the literature to find frequent substructures ofgraphs and molecules. Like our approach, SUBDUE uses molecular graphs labeledas active or inactive. However, FSG cannot use classified graphs. Therefore, we haveused only the active molecules to determine active fragments by FSG and neglectedinactive molecules. In our experiments, we use publicly available original implemen-tations of SUBDUE1 and FSG2 in order to increase reliability and repeatability. Asthe basis of comparisons, we use a set active fragments (F ∗

A) that are determined foranti-tuberculosis using extensive analysis in [2].

In order to compare SUBDUE, FSG and the proposed approach, first we com-pute active fragments for anti-tuberculosis dataset using each of these approaches. LetF subdue

A , F fsgA , and F P

A represent the sets of active fragments computed by SUBDUE,FSG and the proposed approach, respectively. Then, we compare the most significantfragments in F subdue

A , F fsgA , and F P

A with the one in F ∗A. By the most significant frag-

ment, we mean the fragment that has the largest support value, which represents the per-centage of active molecules including this active fragment at the dataset. Let S subdue

A ,Sfsg

A , SPA and S∗

A denote the most significant fragments in F subdueA , F fsg

A , F PA , and F ∗

A.For comparison of the methods, we used two measures; recall and precision. Recall

is defined in Equation 3 as the ratio of correctly discovered bonds of S ∗A by an approach.

In the equation, f(b, S) is a function that returns 1 if the bond b is contained by thefragment S (b ∈ S); otherwise it returns 0. High recall value of an approach impliesthat it can correctly find most of the bonds in S ∗

A, which is the most significant activefragment for tuberculosis.

recall(X) =

∑b∈S∗

Af(b, SX

A )

|S∗A|

(3)

Using only the recall metric, we cannot measure success of an approach in determiningactive fragments, because this measure does not use the excess bonds found by the

1 http://ailab.wsu.edu/subdue2 http://glaros.dtc.umn.edu/gkhome/pafi/overview

approach. That is, recall of an approach X is 1 if S XA ≡ S∗

A or S∗A ∈ SX

A , whereX finds also the bonds that are not a part of an active fragment (S ∗

A). Therefore, weintroduce precision metric in Equation 4. The precision value of an approach X is highif all of the bonds in SX

A is included in S∗A. If both of the recall and precision values are

close to 1.0 for an approach, this means that this approach can correctly and preciselyfind most significant active fragments for tuberculosis.

precision(X) =

∑b∈SX

Af(b, S∗

A)

|SXA | (4)

The results of our experiments are shown at table 1. The table implies that our ap-proach can correctly determine the most significant active fragments, because its recalland precision values are both close to 1.0. Recall and precision values of SUBBDUEare 0.8 and 0.75 respectively. Although, the performance of SUBDUE is also high, ourapproach significantly outperforms it. Unlike SUBDUE and the proposed approach,the performance of FSG is low; its recall and precision values are 0.40 and 0.67, re-spectively. This performance difference is intuitive, because FSG cannot use the classinformation (e.g., active molecule, inactive molecule) while determining fragments. Al-though FSG can find fragments belonging to active molecules, these fragments may notbe active fragments, because they may also repeat in the structure of inactive molecules.However, unlike FSG but similar to the proposed approach, SUBDUE uses class infor-mation while determining frequent fragments. Hence, it can determine frequent frag-ments that exist in active molecules but not in inactive molecules. Therefore, in ourexperiments, performance of SUBDUE is higher than the performance of FSG. Thesefindings imply that inactive molecules have a significant importance on determining ac-tive fragments. Hence an approach that uses both active and inactive molecules shouldbe used to solve active fragment discovery for drug design.

Table 1. Performance of the proposed approach, SUBDUE and FSG with respect torecall and precision metrics

Proposed Approach SUBDUE FSGRecall 0.95 0.80 0.40Precision 0.97 0.75 0.67

4.2 Benchmark Comparisons: Synthetic Datasets

In this part of our experiments, we use ParMol3 package for frequent subgraph min-ing [16], which includes publicly available implementations of gSpan [12], Gaston [13],MoFa [14], and FFSM [15]. We compare our approach with these well-known meth-ods using the synthetic datasets. A synthetic graph dataset is composed of two sets of

3 http://www2.cs.fau.de/Forschung/Projekte/ParMol/

graphs; the graphs representing active molecules (Sa) and the graphs representing inac-tive molecules (Si). In these graphs, labels are real-numbers, where the nodes and edgestake their labels in the ranges of [0 − 4] and [0 − 6], respectively. In order to computethe similarity between nodes (or edges), we use the distance between their labels. Thatis, nodes (or edges) are similar as much as their labels are close in number scale.

Before creating the graphs in Sa or Si, we first create three different sets of patterns;the subgraphs repeating only in active molecules (Pa), the subgraphs repeating only ininactive molecules (Pi) and lastly the subgraphs repeating both in active and inactivemolecules (Pai). These three sets of patterns are inspired by the real-life datasets. Eachgraph G ∈ Sa is created so that it includes subgraphs both from Pa and Pai. However,before adding these subgraphs to G, we modify them according to a parameter p n,called probability of noise. If pn = 0, we directly add subgraphs from Sa and Sai to G.However, if pn > 0, we modify these subgraphs with probability pn by adding somenoise to the labels of their nodes and edges, before adding them to G. The amountof noise to be added is determined randomly in the range of [0 − 0.25]. We also addm other nodes to G, where the number m and the labels of these nodes are chosenrandomly. In order to ensure connectivity of the graph, we add edges randomly betweenthese nodes and the others. Lastly, we randomly give a unique ID to each node in thegraph. Similarly, we create graphs for inactive molecules, but this time these graphsare produced using the patterns from P i and Pai. Using this procedure, we computea number of graphs representing active and inactive molecules. In the resulting graphdataset, some similar patterns repeat in almost every graph, while some others repeatonly in the graphs representing either the active molecules or the inactive molecules.There is a similar case in the real-life datasets, where only the substructures repeating inactive molecules are responsible for the activity of the molecule, while the substructuresrepeating almost in every molecule or only in the inactive molecules do not have anysignificant effect on the activity.

We created seven different dataset using different number of active and inactivemolecules as well as different values for the probability of noise. Each graph in ourdataset has 20 nodes and 40 edges on the average. For each dataset, we exactly knowwhich subgraphs are repeating only in the graphs representing active molecules. Notethat these subgraphs are not repeating exactly the same but with some small differences(called noise). Hence, we can quantitatively measure how successful our approach com-pared to the other approaches in finding these subgraphs. Figure 6 summarizes our re-sults for competing subgraph mining methods on our datasets.

Our experiments show that graph mining methods gSpan, Gaston, MoFa andFFSM can find all of the active substructures correctly when there is no noise pn;however our approach can find only 83% of these substructures. The inferior perfor-mance of our approach in this case can be explained as follows. When p n = 0, there arecases where all of the points in an active cluster has the same coordinates; this meansthat these points are the same instead of being similar. In these cases, computed min-imum volume ellipsoids for these clusters have a radious of zero in each dimension(r = [0, 0, 0]). Therefore, we cannot correctly determine which molecules have piecesfalling into these clusters using Equation 2.

R

1

0

0

0

0

1

0

FFSM

P

1

0

0

0

0

1

0

R

1

0.2

0

0

0

1

0

MoFa

P

1

1

0

0

0

1

0

R

1

0.06

0.06

0.05

0

1

0.05

Gaston

P

1

1

1

1

0

1

1

R

1

0

0

0

0

1

0

gSpan

P

1

0

0

0

0

1

0

R

0.83

0.89

1

1

1

0.94

0.95

Proposed Method

P

1

1

1

1

1

1

1

Probability of Noise

[0-1]

0

0.35

0.5

1

1

0

1

# of negative

molecules

10

10

10

10

50

50

100

# of active molecules

10

10

10

10

50

50

100

Dataset #

1

2

3

4

5

6

7

Fig. 6. Experimental results for the synthetic datasets (P refers to precision and Rrefers to recall).

An increase in the probability of noise results in a dramatic performance decreasein the graph mining methods gSpan, Gaston, MoFa and FFSM . For example, whenpn = 0.35, both the recall and the precision of gSpan and FFSM fall to zero, whilethe recall values of Gaston and MoFa approach to zero. The situation becomes moredramatic when pn becomes 1.0; both the precision and recall become zero for gSpan,Gaston, MoFa and FFSM . In all these cases, our approach can find all of the ac-tive substructures correctly; it has precision and recall values almost equal to 1.0 whenpn > 0. For varying size of the datasets, performance of our approach does not change.These experiments show that our approach can successfully find similar frequent sub-structures repeating in active molecules, but not in inactive molecules; however theother methods can find these substructures correctly only if they repeat exactly the samein the molecules (i.e., pn = 0).

5 Conclusion

An important step of drug design is the analysis of active molecules for a specific dis-ease and the discovery of substructures that are responsible for the activity of thesemolecules. Then, these discovered structures constitute the basis of new drugs for thedisease. Hence, mining molecules for those substructures are important and crucial fordesigning new drugs. Current techniques for mining molecules usually based on graphrepresentation of these molecules and frequent subgraph mining methods. However, inmany real-life settings, these methods cannot find the most informative substructuresthat do not exactly repeat in the molecules but exists in those molecules with some finedifferences.

To the best of our knowledge, problem of discovering frequent similar subgraphsis not addressed before in the literature. In this paper, we have extended the definitionof frequent subgraph mining to formalize this new problem. We also propose an ap-proach for finding these similar subgraphs (e.g., substructures and patters) repeating inthe graphs representing active molecules. We have evaluated the performance of our ap-proach using experiments on anti-tuberculosis dataset. Our experiments show that ourapproach can correctly determine active fragments of molecules that account for theactivity of those molecules. We also measure the performance of our approach for find-

ing similar frequent substructures using synthetic datasets. Our extensive comparisonswith other approaches from the literature confirm the success of our approach in findingfrequent patterns and frequent similar patters in graphs and the molecules.

As a future work, we are planning to make more detailed comparisons. In this study,we use anti-tuberculosis dataset in order to demonstrate and evaluate our approach. Wewant to evaluate our approach using the dataset for other diseases as well. Also, theperformance of our approach is lower when there is no noise in the dataset, this createsan opportunity for further research to improve our approach on this line.

References

1. Fischer, I., Meinl, T.: Graph based molecular data mining – an overview. In: Proceedings ofIEEE International Conference on Systems, Man and Cybernetics. (2004) 4578–4582

2. Macaev, F., Rusu, G., Pogrebnoi, S., Gudima, A., Stingaci, E., Vlad, L., Shvets, N., Kan-demirli, F., Dimoglo, A., Reynolds, R.: Synthesis of novel 5-aryl-2-thio-1,3,4-oxadiazolesand the study of their structureanti-mycobacterial activities. Bioorganic and MedicinalChemistry 13 (2005) 4842–4850

3. Nayyar, A., Monga, V., Malde, A., Coutinho, E., Jain, R.: Synthesis, anti-tuberculosis ac-tivity, and 3d-qsar study of 4-(adamantan-1-yl)-2-substituted quinolines. Bioorganic andMedicinal Chemistry 15 (2007) 626–640

4. Borgelt, C., Berthold, M.R.: Mining molecular fragments: Finding relevant substructures ofmolecules. In: Proceedings of the IEEE International Conference on Data Mining(ICDM).(2002) 51–58

5. Washio, T., Motoda, H.: State of the art of graph-based data mining. ACM SIGKDD Explo-rations Newsletter 5 (2003) 59–68

6. Shvets, N., A.S.Dimoglo: The electron-topological method (etm): Its further developmentand use in the problems of sar study. In Gundertofte, K., Jorgensen, F.S., eds.: MolecularModeling and Prediction of Bioactivity. (1999) 418–429

7. Alpaydin, E.: Introduction to Machine Learning. MIT Press (2001)8. Perner, P.: Data Mining on Multimedia Data. Springer-Verlag, New York (2003)9. Todd, M.J., Yıldırım, E.A.: On Khachiyan’s algorithm for the computation of minimum-

volume enclosing ellipsoids. Discrete Applied Mathematics 155 (2007) 1731–174410. Cook, D.J., Holder, L.B.: Graph-based data mining. IEEE Intelligent Systems 15 (2000)

32–4111. Kuramochi, M., Karypis, G.: Frequent subgraph discovery. In: Proceedings of the IEEE

International Conference on Data Mining (ICDM). (2001) 313–32012. Yan, X., Han, J.: gSpan: Graph-based substructure pattern mining. In: Proceedings of the

IEEE International Conference on Data Mining (ICDM). (2002) 721–72413. Nijssen, S., Kok, J.N.: A quickstart in frequent structure mining can make a difference. In:

KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledgediscovery and data mining. (2004) 647–652

14. Borgelt, C., Berthold, M.R.: Mining molecular fragments: Finding relevant substructures ofmolecules. In: ICDM ’02: Proceedings of the 2002 IEEE International Conference on DataMining. (2002) 51–58

15. Huan, J., Wang, W., Prins, J.: Efficient mining of frequent subgraphs in the presence ofisomorphism. In: ICDM ’03: Proceedings of the Third IEEE International Conference onData Mining. (2003) 549–552

16. Meinl, T., Worlein, M., Urzova, O., Fischer, I., Philippsen, M.: The parmol package forfrequent subgraph mining. Electronic Communications of the EASST 1 (2007) 1–12