8
FGMAC: Frequent subGraph Mining with Arc Consistency Brahim Douar and Michel Liquiere LIRMM, 161 rue Ada 34392 - Montpellier, France {douar,liquiere}@lirmm.fr Chiraz Latiri and Yahya Slimani URPAH Team, Faculty of Sciences of Tunis [email protected], [email protected] Abstract—With the important growth of requirements to analyze large amount of structured data such as chemical compounds, proteins structures, XML documents, to cite but a few, graph mining has become an attractive track and a real challenge in the data mining field. Among the various kinds of graph patterns, frequent subgraphs seem to be relevant in characterizing graphsets, discriminating different groups of sets, and classifying and clustering graphs. Because of the NP- Completeness of subgraph isomorphism test as well as the huge search space, fragment miners are exponential in runtime and/or memory consumption. In this paper we study a new polynomial projection operator named AC-Projection based on a key tech- nique of constraint programming namely Arc Consistency (AC). This is intended to replace the use of the exponential subgraph isomorphism. We study the relevance of frequent AC-reduced graph patterns on classification and we prove that we can achieve an important performance gain without or with non-significant loss of discovered pattern’s quality. Index Terms—Graph mining; AC-projection; Graph classifi- cation; I. I NTRODUCTION In front of the urgent need to analyze large amount of structured data such as chemical compounds, proteins struc- tures, XML documents, to cite but a few, graph mining has become a compelling issue in the data mining field. Indeed, discovering frequent subgraphs, i.e., discovering subgraphs which occur frequently enough over the entire set of graphs, is a real challenge due to their exponential number. Indeed, based on the APRIORI principle [1], a frequent n-edge graph may contain 2 n frequent subgraphs. This raises a serious problem related to the exponential search space as well as the counting of complete sub-patterns while the kernel of frequent subgraph mining is subgraph isomorphism test which has been proved to be in NP-complete complexity class [2]. In this paper, we study an innovative projection operator intended to replace the costly subgraph isomorphism. In the second section we give a brief literature review of the subgraph mining field. Then, we present the AC-projection operator initially introduced in [3], as well as its very interesting proper- ties. We propose an efficient graph mining algorithm using the AC-projection operator. Finally, we study the relevance of the AC-reduced patterns for the supervised graph classification. II. FREQUENT SUBGRAPH MINING Given a database consisting of small graphs, for example, molecular graphs, the problem of mining frequent subgraphs is to find all subgraphs that are subgraph isomorphic with a large number of example graphs in the database. In this section we define preliminary concepts as well as a brief review of literature related to frequent subgraph mining. A. Preliminary Concepts Definition II.1 (Labeled Graph) A labeled graph can be represented by a 4-tuple, G =(V, A, L, l), where V is a set of vertices, A V × V is a set of edges, L is a set of labels, l : V A L, l is a function assigning labels to the vertices and the edges. Definition II.2 (Isomorphism, Subgraph Isomorphism) Given two graphs G 1 and G 2 , an isomorphism is a bijective function f : V (G 1 ) V (G 2 ), such that x V (G 1 ),l(x)= l(f (x)), and (x, y) A(G 1 ), (f (x),f (y)) A(G 2 ) and l(x, y)= l(f (x),f (y)). A subgraph isomorphism from G 1 to G 2 is an isomorphism from G 1 to a subgraph of G 2 . Definition II.3 (Frequent Subgraph Mining) Given a graph dataset, GS={G i | i = 0......n}, and a minimal support (minSup), let ς (g,G)= { 1 if there is a projection from g to G 0 otherwise. σ (g, GS)= Gi GS ς (g,G i ) σ (g, GS) denotes the occurrence frequency of g in GS, i.e., the support of g in GS. Frequent subgraph mining is to find every graph g such that σ (g, GS) is greater than or equal to minSup. Known frequent subgraphs miners are based on this definition and deal with the special case where the projection operator is subgraph isomorphism. 978-1-4244-9925-0/11/$26.00 ©2011 IEEE 112

FGMAC: Frequent subgraph mining with Arc Consistency

Embed Size (px)

Citation preview

FGMAC: Frequent subGraph Mining with ArcConsistency

Brahim Douar and Michel LiquiereLIRMM, 161 rue Ada 34392 - Montpellier, France

{douar,liquiere}@lirmm.fr

Chiraz Latiri and Yahya SlimaniURPAH Team, Faculty of Sciences of Tunis

[email protected], [email protected]

Abstract—With the important growth of requirements toanalyze large amount of structured data such as chemicalcompounds, proteins structures, XML documents, to cite buta few, graph mining has become an attractive track and areal challenge in the data mining field. Among the variouskinds of graph patterns, frequent subgraphs seem to be relevantin characterizing graphsets, discriminating different groups ofsets, and classifying and clustering graphs. Because of the NP-Completeness of subgraph isomorphism test as well as the hugesearch space, fragment miners are exponential in runtime and/ormemory consumption. In this paper we study a new polynomialprojection operator named AC-Projection based on a key tech-nique of constraint programming namely Arc Consistency (AC).This is intended to replace the use of the exponential subgraphisomorphism. We study the relevance of frequent AC-reducedgraph patterns on classification and we prove that we can achievean important performance gain without or with non-significantloss of discovered pattern’s quality.

Index Terms—Graph mining; AC-projection; Graph classifi-cation;

I. INTRODUCTION

In front of the urgent need to analyze large amount ofstructured data such as chemical compounds, proteins struc-tures, XML documents, to cite but a few, graph mining hasbecome a compelling issue in the data mining field. Indeed,discovering frequent subgraphs, i.e., discovering subgraphswhich occur frequently enough over the entire set of graphs, isa real challenge due to their exponential number. Indeed, basedon the APRIORI principle [1], a frequent n-edge graph maycontain 2n frequent subgraphs. This raises a serious problemrelated to the exponential search space as well as the countingof complete sub-patterns while the kernel of frequent subgraphmining is subgraph isomorphism test which has been provedto be in NP-complete complexity class [2].

In this paper, we study an innovative projection operatorintended to replace the costly subgraph isomorphism. In thesecond section we give a brief literature review of the subgraphmining field. Then, we present the AC-projection operatorinitially introduced in [3], as well as its very interesting proper-ties. We propose an efficient graph mining algorithm using theAC-projection operator. Finally, we study the relevance of theAC-reduced patterns for the supervised graph classification.

II. FREQUENT SUBGRAPH MINING

Given a database consisting of small graphs, for example,molecular graphs, the problem of mining frequent subgraphs

is to find all subgraphs that are subgraph isomorphic with alarge number of example graphs in the database. In this sectionwe define preliminary concepts as well as a brief review ofliterature related to frequent subgraph mining.

A. Preliminary Concepts

Definition II.1 (Labeled Graph) A labeled graph can berepresented by a 4-tuple, G = (V,A,L, l), where

• V is a set of vertices,• A ⊆ V × V is a set of edges,• L is a set of labels,• l : V ∪ A → L, l is a function assigning labels to the

vertices and the edges.

Definition II.2 (Isomorphism, Subgraph Isomorphism) Giventwo graphs G1 and G2, an isomorphism is a bijective functionf : V (G1)→ V (G2), such that∀x ∈ V (G1), l(x) = l(f(x)), and∀(x, y) ∈ A(G1), (f(x), f(y)) ∈ A(G2) and l(x, y) =l(f(x), f(y)).A subgraph isomorphism from G1 to G2 is an isomorphismfrom G1 to a subgraph of G2.

Definition II.3 (Frequent Subgraph Mining) Given a graphdataset, GS={Gi| i = 0......n}, and a minimal support(minSup), let

ς(g,G) =

{1 if there is a projection from g to G

0 otherwise.

σ (g,GS) =∑

Gi ∈ GS

ς(g,Gi)

σ (g,GS) denotes the occurrence frequency of g in GS,i.e., the support of g in GS. Frequent subgraph mining isto find every graph g such that σ (g,GS) is greater than orequal to minSup.

Known frequent subgraphs miners are based on this definitionand deal with the special case where the projection operatoris subgraph isomorphism.

978-1-4244-9925-0/11/$26.00 ©2011 IEEE 112

B. Related Works

Algorithms for frequent subgraphs mining are based on twopatterns discovery paradigms namely breadth-first search anddepth-first search. They aim to find the connected subgraphsthat have a sufficient number of edge disjoint embedding ina single large undirected labeled sparse graph. Most of thesealgorithms use different methods for determining the numberof edge-disjoint embeddings of a subgraph and employ differ-ent ways for candidates generation and support counting. Aninteresting quantitative comparison of the most cited subgraphminers is given in [4].

The novel graph mining approach that we present in thispaper is based on a breadth-first approach intensively cited inlitterature. The following section in intended to present thisapproach named FSG [5].

C. The Fsg Algorithm

Principal breadth-first approaches take advantage of theAPRIORI [1] levelwise approach. The FSG algorithm findsfrequent subgraphs using the same level-by-level expansionadopted. FSG uses a sparse graph representation which mini-mizes both storage and computation and it increases the size offrequent subgraphs by adding one edge at a time, allowing togenerate the candidates efficiently. Various optimizations havebeen proposed for candidate generation and counting whichallowed it to scale to large graph databases. For problems inwhich a moderately large number of different types of verticesand edges exist, FSG was able to achieve good performanceand to scale linearly with the database size. For problemswhere the number of edge and vertex labels was small, theperformance of FSG was worse, as the exponential complexityof graph isomorphism dominates the overall performance.

In this paper, we are particularly interested by the FSG algo-rithm. Indeed, we propose a basic subgraph mining approachwhich is a modified FSG version that uses a novel operatorfor the support counting process.

D. Critical Discussion

Developing algorithms that discover all frequently occurringsubgraphs in a large graph database is computationally exten-sive, as graph and subgraph isomorphisms play a key rolethroughout the computations. Since subgraph isomorphismtesting is a hard problem, fragment miners are exponential inruntime. Many frequent subgraphs miners have tried to avoidthe NP-completeness of subgraph isomorphism problem bystoring all embeddings in embedding lists which consist ofa mapping of the vertices and edges of a fragment to thecorresponding vertices and edges in the graph it occurs in. Itis clear that with this trick we can avoid excessive subgraphisomorphism tests when counting fragments support and, bythe way, avoid exponential runtime. However these approachesface exponential memory consumption instead. So, we can saythat they are only trading time versus storage not more andon the other hand can cause problems if not enough memoryis available or if the memory throughput is not high enough.The authors in [4], after an extensive experimental study of

different subgraphs miners, say that embedding lists do notconsiderably speed up the search for frequent fragments. Soeven though GSPAN [6] does not use them, it is competitive toGASTON [7] and FFSM [8], at least with not too big fragments.

So, it seems that a better way to avoid exponential runtimeand/or memory consumption is the use of another projectionoperator instead of the subgraph isomorphism. This projectionhas to have a polynomial complexity as well as a polynomialmemory consumption.

In [3], the author introduced an interesting projection opera-tor named AC-projection which seems to have good propertiesand ensure polynomial time and memory consumption. Theforthcoming sections present this operator with its manyinteresting properties and show an optimized algorithm forcomputing it.

III. AC-PROJECTION

The approach suggested in [3] advocates a projection oper-ator based on the arc consistency algorithm. This projectionmethod has the required properties: polynomiality, local vali-dation, parallelization, structural interpretation.

A. AC-projection And Arc Consistency

Definition III.1 (Labeling) Let G1 and G2 be two graphs.We named labeling from G1 into G2 a mapping I : V (G1)→2V (G2)|∀x ∈ V (G1), ∀y ∈ I(x), l(x) = l(y).

Thus, for a vertex x ∈ V (G1).I(x) is a set of vertices ofG2 with the same label l(x). We can say that I(x) is the setof “possible images” of the vertex x in G2.

This first labeling is trivial but can be refined using theneighborhood relations between vertices.

Definition III.2 (AC-compatible y ) Let G be a graph V1

⊆ V (G), V2 ⊆ V (G)

V1 is AC-compatible with V2 iff

1) ∀xk ∈ V1∃yp ∈ V2|(xk, yp) ∈ A(G)2) ∀yq ∈ V2∃xm ∈ V1|(xm, yq) ∈ A(G).

We note V1 y V2

Definition III.3 (Consistency for one arc) Let G1 and G2 betwo graphs. We say that a labeling I : V (G1) → V (G2) isconsistent with an arc (x, y) ∈ A(G1), iff I(x) y I(y).

Definition III.4 (AC-labeling) Let G1 and G2 be two graphs.A labeling I from G1 into G2 is an AC-labeling iff I isconsistent with all the arcs e ∈ A(G1).

Definition III.5 (AC-projection ⇁ ) Let G1 and G2 be twographs. An AC-labeling I : V (G1) → V (G2) is an AC-projection iff ∀ AC-labeling I ′ : V (G1) → V (G2) and∀x ∈ V (G1), I ′(x) ⊆ I(x). We note it G1 ⇁ G2

113

Algorithm 1: AC-projectionInput : Two graphs G1 and G2

Output: An AC-projection I from G1 into G2 if thereis, otherwise else an empty set

//Initialisationforeach x ∈ V (G1) doI(x)={y ∈ V (G2)|l(x) = l(y)}

S ← A(G1);P ← ∅;while S = ∅ do

Choose an arc (x, y) from S; // in general the firstelement of SI ′:=ReviseArc ((x, y), I, G2);//If for one vertex x ∈ V (G1) we have I ′(x) = ∅then there is no arc consistencyif (I ′(x) = ∅) or (I ′(y) = ∅) then

return ∅;//I ′ is consistent now with the arc (x, y); but it canbe non-consistent with some other previously testedarcs so we have to verify and change (if necessary),the consistency of all these arcs.if I(x) = I ′(x) then

R← {(x′, y′) ∈ P |x′ = x or y′ = x};S ← S ∪R;P ← P rR;

if I(y) = I ′(y) thenR← {(x′, y′) ∈ P |x′ = y or y′ = y};S ← S ∪R;P ← P rR;

S ← S r {x, y};P ← P ∪ {x, y};I ← I ′;

B. AC-projection: Improved Algorithm

We give an improved AC-projection algorithm for graphs(based on the AC3 algorithm [9]). The AC-projection algo-rithm takes two graphs G1 and G2 and tests if there is anAC-projection from G1 into G2 (see Algorithm 1). It beginsby the creation of a first rough labeling I and reduces, foreach vertex x, the given lists I(x) to consistent lists usingthe function ReviseArc. The consistency check fails ifsome I(x) becomes empty, otherwise the consistency checksucceeds and the algorithm gives the labeling I which is anAC projection G1 ⇁ G2. Like the AC3 algorithm, the actualAC-projection algorithm has a worst-case time complexity ofO(e×d3) and space complexity of O(e) where e is the numberof arcs and d is the size of the largest domain. In our case,the size of the largest domain is the size of the largest subsetof nodes with the same label.

C. AC-projection And Reduction

The following definition introduces an equivalence relationbetween graphs w.r.t. AC-projection.

Function ReviseArcInput : A graph G2, A labeling I from G1 into G2, An

arc (x, y) ∈ V (G1)Output: A new labeling I ′ from G1 into G2

I ′ ← I;I ′(x)← I(x)r {x′ ∈ V (G2)|@ y′ ∈ I(y) with(x′, y′) ∈ A(G2)};I ′(y)← I(y)r {y′ ∈ V (G2)|@ x′ ∈ I(x) with(x′, y′) ∈ A(G2)};return I ′;

Definition III.6 (AC-equivalent graphs)Two graphs G1 and G2 are AC-equivalent iff both G1 ⇁ G2

and G2 ⇁ G1 are fulfilled.We note G1 � G2.

We have an equivalence relation between graph using theAC-projection. In this paragraph we study the properties of thisoperation and search for a reduced element in an equivalenceclass of graphs. This element will be the unique representativeof this equivalence class, and for which we give then the nameof “AC-reduced graph” .

Fig. 1. AC-equivalent graphs and the associated AC-reduced one (extremeright)

1) Auto AC-projection And AC-reduction: We study theauto AC-projection operation (G ⇁ G), which we will useto find the minimal graph of an equivalence class of graphsand we will prove in the following that the obtained graph isminimal.

Proposition III.7 Given an AC-projection I : G ⇁ G′,x′ ∈ I(x) iff for each tree T (VT , AT ), (with VT is the set ofvertices of T , and AT its set of arcs) and each t ∈ VT wehave:

If there is a morphism from T to G which associates t tox then there is a morphism from T to G′ which associates tto x′. [10]

Proposition III.8 (Order relation on I)For an AC-projection I : G ⇁ G, if xi ∈ I(x) then

I(xi) ⊆ I(x)

Proof: If we have xi ∈ I(x), it means that for all treesT having a morphism in G and which associates t to x, thenthere is a morphism from t in G which associates t to xi

(Proposition III.7). We call T (x,t) this set of trees.Let’s see now if we can have xj ∈ I(xi) and xj /∈ I(x).For xj ∈ I(xi), according to Proposition III.9, we see that

all trees from T (x,t) associates to t the vertex xi. Since xj ∈I(xi), it will be the same for it, so xj ∈ I(x).

114

We conclude that we can not have xj /∈ I(x) since xj ∈I(xi), so: I(xi) ⊆ I(x).

Proposition III.9 Given a graph G and an AC-projection I :G ⇁ G and given a vertex x ∈ V (G) with |I(x)| > 1.If we have xi ∈ I(x), the graph G′ formed by the merging ofx and xi is AC-equivalent to G.

Proof: To prove that G G′ we have to prove thatG ⇁ G′ and G′ ⇁ G.Since G′ ⇁ G by construction we have only to prove thatG ⇁ G′:

We construct this AC-projection by replacing x by xi in theauto AC-projection G ⇁ G. Since I(xi) ⊆ I(x), so there isreally an AC-projection. We conclude that G ⇁ G′.

Now, we want to find the smallest element of the equiva-lence class of graphs. For two AC-equivalent graphs G andG′, we will consider that G < G′ iff |V (G)| < |V (G′)|.

Proposition III.10 (Minimality)A graph G is minimal in the equivalence class iff for the

AC-projection I : G ⇁ G, ∀x ∈ V (G), I(x) = {x}.

Proof: According to Proposition III.9, it is clear that ifthere was a vertex x such that |I(x)| > 1, then we will beable to do another reduction.Now, the question is: can a graph G′ = G\x be AC-equivalentto G ?If this is true, then we must have an AC-projection from Gto G′. It would say that x in G has another image x′ in G′.So, x′ must be in I(x) which is contradictory to the initialhypothesis.

Algorithm 3: AC-reduceInput : A graph GOutput: G′ =AC-reduced G

G′ ← G;I ←AC-projection (G,G);Q← V (G);Sort Q such as x comes before y if |I(x)| < |I(y)|;foreach v in Q do

foreach i in I(v) doif (i = v) then

N(v)← N(v) ∪N(i); //if v and i areneighbors, then we would have a reflexive arcQ← Qr i;V (G′)← V (G′)r i;

return G′;

2) AC-reduce Algorithm: The AC-reduce algorithm isbased on the properties given on the section above, these prop-erties allow to construct the AC-reduced graph consideringany graph G. To do this, we simply have to do an auto AC-projection G ⇁ G and then make the necessary merges. Sothis algorithm is very simple and have a polynomial complex-ity, since the AC-projection’s complexity is polynomial.

IV. FGMAC: FREQUENT SUBGRAPH MINING WITH ARCCONSISTENCY

In this section, we present FGMAC, a modified version ofthe FSG algorithm [5] based on the AC-projection operator. Infact, in this version we have changed the support counting part.Instead of subgraph isomorphism, the AC-projection is usedto verify whether a candidate graph appears in a transactionor not.

A. The Algorithm

The FGMAC algorithm initially enumerates all the frequentsingle and double edge graphs. Then, based on those two sets,it starts the main computational loop. During each iteration itfirst generates candidate subgraphs whose size is greater thanthe previous frequent ones by one edge (Algorithm 4, line 5).Next, it counts the frequency for each of these candidates,and prunes subgraphs that do no satisfy the support constraint(Algorithm 4, lines 6-11). Discovered frequent subgraphs sat-isfy the downward closure property of the support condition,which allows us to effectively prune the lattice of frequentsubgraphs.

The FGMAC’s particularity is to return only frequent AC-reduced graphs (Algorithm 4, line 11) which is a subset of thewhole frequent isomorphic pattern set.

In the following we present the key three steps of theFGMAC main process.

Algorithm 4: FGMAC

Input : A graph dataset D, Minimal support σOutput: The set F of frequent subgraphs

F 1 ←detect all frequent (1-edge)-subgraphs in D;F 2 ←detect all frequent (2-edges)-subgraphs in D;k ← 3;while F k−1 = ∅ do

Ck ← fsg-gen (F k−1)foreach candidate gk ∈ Ck do

gk.count← 0;foreach transaction t ∈ D do

if gk ⇁ t thengk.count← gk.count+ 1;

F k ← {AC-reduce(gk ∈ Ck)|gk.count ≥ σ|D|};k ← k + 1;

return F ;

B. Candidate Generation

This step is assured by the same fsg-gen function (seeAlgorithm 4, line 5) used in the FSG algorithm. This functionuses a precise joining operator (fsg-join) which generates(k+1)− edges subgraphs by joining two frequent k− edgessubgraphs. In order for two such frequent k−edges subgraphsto be eligible for joining they must contain the same (k−1)−edges subgraph named core. The complete description of thesefunctions as well as their detailed algorithms are given in [11].

115

TABLE ICLASSIFICATION DATASETS STATISTICS

Distinct labels Edges / Transaction Vertices / TransactionDatasets Transactions Edge Vertices Average Max Average Max

HIA 86 3 8 24 44 22 40PTC-FM 349 3 19 26 108 25 109PTC-FR 351 3 20 27 108 26 109

PTC-MM 336 3 21 25 108 25 109PTC-MR 344 3 19 26 108 26 109

HIA PTC-FM

Fig. 2. Runtime comparison of FGMAC versus FSG with the two datasets HIA and PTC-FM

C. Support Calculation

The key operator leading this step is the AC-projection pre-viously described. In fact, to verify whether a pattern appearsin a transaction or not, FGMAC calculate in polynomial timeif there is an AC-projection of the pattern in each one ofthe transactions. In order to optimize this support calculationphase, the algorithm associates to each graph g of size k, theset E(g) of transactions such as for each graph G ∈ E(g),g ⇁ G. Having the graph g1 ∪ g2 representing the union ofthe two graphs g1 and g2, the intersection of the two setsE(g1) ∩ E(g2) is then calculated.

As E(g1 ∪ g2) ⊆ E(g1)∩E(g2), it is possible to eliminatethe graph g1 ∪ g2 if the transaction’s count E(g1) ∩ E(g2)is sufficiently low to make g1 ∪ g2 infrequent. By the otherhand, the existence of a subgraph which has an AC-projectionin g1∪g2 can be only searched in transactions E(g1)∩E(g2).

D. Frequents AC-reduction

This step is essential at the end of each iteration of themain loop of the algorithm. It is intended to avoid theextraction of non AC-reduced frequent graphs, which representrepresentative elements of graphs equivalence classes w.r.t.AC-euivalence. This process is based on the AC-reducefunction described previously. We note that that this step takesadvantage of the polynomial complexity of the AC-reductionalgorithm.

V. EXPERIMENTS AND COMPARATIVE STUDY

In order to prove the usefulness of the AC-projection forgraph mining, we present in the following an experimentalstudy of the FGMAC algorithm. We insist that the set of fre-quent AC-reduced graphs found by FGMAC is not exhaustivew.r.t. isomorphic patterns. So, in the following, we presenta quantitative study of the FGMAC performance followedby a qualitative evaluation of the AC-reduced patterns which

consists in a calculation of their discriminative power withina supervised graph classification process.

A. Datasets

We carried out performance and classification experimentson five biological activity datasets widely cited in the literature.These datasets can be divided in two groups:

• The Predictive Toxicology Challenge (PTC) [12],containsa set of chemical compounds classified according to theirtoxicity in male rats (PTC-MR), female rats (PTC-FR),male mice (PTC-MM), and female mice (PTC-FM).

• The Human Intestinal Absorption (HIA) [13], containschemical compounds classified by intestinal absorptionactivity.

B. Performance Point Of View

In this subsection we present a quantitative study of thecomputational performance of FGMAC compared to FSG.Results depicted in Figure 2 clearly show that FGMACoutperform FSG in the runtime point of view for all minimalsupports selected and confirm the theorical results about thepolynomiality of the AC-projection operator compared to theexponential complexity of the subgraph isomorphism adoptedby FSG.

In the following section, we present a study in a qualitativepoint of view of frequent AC-reduced patterns.

C. Qualitative Point Of View: Graph Classification

Graph classification is a supervised learning problem inwhich the goal is to categorize an entire graph as a positiveor negative instance of a concept.Feature mining on graphsis usually performed by finding all frequent or informativesubstructures in the graph instances. These substructures areused for transforming the graph data into data representedas a single table, and then traditional classifiers are used for

116

HIA (Frequent) PTC-FM (Frequent)

HIA (Closed) PTC-FM (Closed)

Fig. 3. Comparison of the number of patterns of different feature set for PTC-FM and HIA datasets

classifying the instances. The aim of using graph classificationin this paper is the evaluation of the quality and discriminativepower of frequent AC-reduced subgraph patterns, and tocompare it with isomorphic frequent subgraphs.

We carried out classification experiments on five biologicalactivity datasets, and measured classifier prediction accuracyusing the known decision trees classifier named C4.5 [14].The classification methods are described in more detail in thefollowing subsections, along with the associated results.

1) Methods: We evaluated the classification accuracy usingtwo different feature sets. The first set of features (Frequent)consists of all frequent subgraphs. Those subgraphs are minedusing the FSG software [5] with different minimal supports.Each chemical compound is represented by a binary vectorwith length equal to the number of mined subgraphs. Eachsubgraph is mapped to a specific vector index, and if achemical compound contains a subgraph then the bit at thecorresponding index is set to one, otherwise it is set to zero.

The second feature set (Closed) is simply a subset of thefirst set. In fact, it consists of only closed frequent subgraphs.Those subgraphs are also mined using FSG with the specialparameter (-x) to hold only closed frequent subgraphs.

The third feature set (AC-reduced) contains the FGMACoutput which consists of only AC-reduced frequent subgraphs.We have represented each chemical compound by a binaryvector with length equal to the number of AC-reduced minedsubgraphs. Each AC-reduced subgraph is mapped to a specificvector index, and if there is an AC-projection from the AC-reduced subgraph to the chemical compound then the bit atthe corresponding index is set to one, otherwise it is set tozero.

Finally, the fourth feature set (Closed AC-reduced) is similarto the third one, the difference is that we only considerclosed AC-reduced frequent subgraphs with a special param-

eter passed to FGMAC.2) Results: All classifications have been done with the

Weka data-mining software package [15], and we havereported results of the prediction accuracy over the 10cross-validation trials. In the following we are analyzing theAC-reduced patterns from quantitative and qualitative pointsof view.

a) Patterns Count: According to results showed inFigure 3, we see that for all datasets we have very fewAC-reduced frequent patterns compared to the isomorphicones. We have on average 35% less patterns. This ratiois bigger for lower supports and can reach up to 70% forthe HIA dataset with a minimal support of 10%. Theseexperimental results confirm that the search space forextracting AC-reduced patterns is smaller than the one forclassical isomorphic subgraphs. So, having an algorithmwhich looks for all AC-reduced frequent subgraphs wouldbenefit for the polynomiality of the projection operation aswell as a smaller search space (i.e. fewer AC-projection tests).

b) Classification Relevance: When we see that the num-ber of frequent subgraph patterns has drastically decreasedafter the AC-reduction process, we surely wonder about therelevance of these few patterns for supervised graph classifica-tion. That’s why we have conducted classification’s accuracyexperiments using AC-reduced and isomorphic patterns tocompare them.

As showed in figure 4, we see that for the all datasets andall classifiers average, the percentage of correctly classified(PCC) instances is almost the same for all minimal support,as well as for the other datasets individually.

Taking a more in-depth look to the results, we see that, forsome datasets and minimal support values, we even have better

117

All datasets (Frequent) HIA (Frequent) PTC-FM (Frequent)

All datasets (Closed) HIA (Closed) PTC-FM (Closed)

Fig. 4. Comparison of the classification accuracy (PCC) of different feature set for All datasets(Average), PTC-FM and HIA datasets

PCC for AC-reduced feature set. This is due to the better gen-eralization power of the AC-reduction process, which helpedsupervised classifiers avoiding over-fitting learning problem.

VI. CONCLUSION AND FUTURE WORK

In this paper, we have studied the use of a new polynomialprojection operator named AC-Projection initially introducedin [3] and based on a key technique of constraint programmingnamely Arc Consistency (AC). We have showed that usingthe AC-projection and its properties has permitted us to haveless patterns than all frequent or closed subgraphs but witha very comparable quality and discriminative power. AC-projection is intended to replace the use of the exponentialsubgraph isomorphism, as well as reducing the search spacewhen seeking for frequent subgraphs.

As a soon perspective, we are working on a depth-first fre-quent subgraph mining approach based on the AC-projectionoperator. Given a graph dataset, this novel approach will beable to looks for all frequent AC-reduced patterns with areduced search space.

REFERENCES

[1] R. Agrawal and R. Skirant, “Fast algorithms for mining associationrules,” in proceedings of the 20th International Conference on VeryLarge Databases, Santiago, Chile, June 1994, pp. 478–499.

[2] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guideto the Theory of NP-Completeness. New York, NY, USA: W. H.Freeman & Co., 1979.

[3] M. Liquiere, “Arc consistency projection: A new generalization relationfor graphs,” in ICCS, ser. LNCS, U. Priss, S. Polovina, and R. Hill, Eds.,vol. 4604. Springer, 2007, pp. 333–346.

[4] M. Worlein, T. Meinl, I. Fischer, and M. Philippsen, “A quantitativecomparison of the subgraph miners mofa, gspan, ffsm, and gaston,” inEuropean Conference on Machine Learning and Principles and Practiceof Knowledge Discovery in Databases, ser. LNCS, vol. 3721. Springer,2005, pp. 392–403.

[5] M. Kuramochi and G. Karypis, “Frequent subgraph discovery.” inInternational Conference on Data Mining, N. Cercone, T. Y. Lin, andX. Wu, Eds. IEEE Computer Society, 2001, pp. 313–320.

[6] X. Yan and J. Han, “gspan: Graph-based substructure pattern mining,”in International Conference on Data Mining. IEEE Computer Society,2002, pp. 721–724.

[7] S. Nijssen and J. N. Kok, “The gaston tool for frequent subgraphmining,” in International Workshop on Graph-Based Tools (Grabats).Electronic Notes in Theoretical Computer Science, 2004, pp. 77–87.

[8] J. Huan, W. Wang, and J. Prins, “Efficient mining of frequent subgraphsin the presence of isomorphism.” in International Conference on DataMining. IEEE Computer Society, 2003, p. 549.

[9] A. K. Mackworth, “Consistency in networks of relations,” Artif. Intell.,vol. 8, no. 1, pp. 99–118, 1977.

[10] P. Hell and J. Nesetril, Graphs and homomorphism, O. L. S. in Math-ematics and its Application, Eds. Oxford: Oxford University Press,2004, vol. 28.

[11] M. Kuramochi and G. Karypis, “An efficient algorithm for discover-ing frequent subgraphs,” IEEE Transactions on Knowledge and DataEngineering, vol. 16, pp. 1038–1051, 2004.

[12] C. Helma, R. D. King, S. Kramer, and A. Srinivasan, “The predictivetoxicology challenge 20002001,” Bioinformatics, vol. 17, no. 1, pp. 107–108, 2001.

[13] M. D. Wessel, P. C. Jurs, J. W. Tolan, and S. M. Muskal, “Predictionof human intestinal absorption of drug compounds from molecularstructure,” Journal of Chemical Information and Computer Sciences,vol. 38, no. 4, pp. 726–735, 1998.

[14] J. R. Quinlan, C4.5: Programs for Machine Learning, 1st ed. MorganKaufmann, January 1993.

[15] I. H. Witten and E. Frank, Data Mining: Practical Machine LearningTools and Techniques, Second Edition (Morgan Kaufmann Series in DataManagement Systems). San Francisco, CA, USA: Morgan KaufmannPublishers Inc., 2005.

[16] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning,vol. 20, no. 3, pp. 273–297, 1995.

[17] B. Dasarathy, Nearest Neighbor (NN) norms : NN pattern classificationtechniques. IEEE Computer Society Press, 1991.

[18] R. Duda and P. Hart, Pattern Classification and Scene Analysis. NewYork: John Wiley and Sons, 1973.

118

APPENDIX: FULL GRAPH CLASSIFICATION RESULTS (%)

TABLE IIPCC RESULTS FOR ALL DATASETS, ALL CLASSIFIERS AND DIFFERENT MINIMAL SUPPORTS.

Minimal support = 10%

Frequents Closed AC-reduced Closed AC-reduced

Datasets # SVM NN NB C4.5 # SVM NN NB C4.5 # SVM NN NB C4.5 # SVM NN NB C4.5

HIA 1964 60,69 61,67 56,11 44,86 216 66,67 57,22 52,5 51,53 467 54,03 52,5 49,31 54,72 118 60,97 57,64 54,72 55,97

PTC-FM 2492 56,48 60,17 51,31 58,18 285 62,77 63,6 51,88 62,73 1271 59,34 60,48 53 61,87 225 59,06 61,63 51,87 63,04

PTC-FR 2749 64,96 62,98 54,13 64,39 336 65,53 61,83 54,98 61,25 1347 62,67 61,25 58,1 63,25 245 66,95 63,26 59,83 63,23

PTC-MM 2472 64,29 59,48 46,43 61,04 261 64,02 61,35 56,6 63,18 1270 59,55 59,8 46,16 60,17 212 63,73 63,44 59,27 62,58

PTC-MR 2665 63,03 56,95 54,39 57,24 345 59,27 53,43 58,16 57,51 1346 62,74 56,36 54,66 55,51 262 61,61 52,25 57,28 56,95

Minimal support = 20%

Datasets # SVM NN NB C4.5 # SVM NN NB C4.5 # SVM NN NB C4.5 # SVM NN NB C4.5

HIA 336 52,5 56,81 49,03 51,94 71 54,58 61,81 51,11 57,92 119 56,11 53,89 46,81 46,94 47 52,36 54,44 47,78 56,94

PTC-FM 631 61,92 57,29 52,15 59,02 103 59,34 57,56 49,01 57,02 408 60,46 59,88 55,3 57,3 86 55,05 58,17 48,72 50,98

PTC-FR 694 64,94 63,56 51,56 60,12 102 63,24 61,83 51,85 64,95 445 60,42 61,53 56,96 57,29 89 61,8 60,39 54,96 63,52

PTC-MM 634 62,78 58,3 49,11 65,77 99 61,33 58,34 47,92 56,85 416 55,98 55,29 49,67 64,01 83 58,98 53,87 48,81 57,12

PTC-MR 652 64,23 59,29 50,26 57,29 99 65,37 56,07 54,07 58,43 418 56,04 56,05 52,32 54,93 85 59,82 54,9 52,86 58,99

Minimal support = 30%

Datasets # SVM NN NB C4.5 # SVM NN NB C4.5 # SVM NN NB C4.5 # SVM NN NB C4.5

HIA 152 50,14 44,17 50,28 47,78 25 49,31 52,5 46,53 47,78 71 58,19 47,64 46,67 43,19 18 52,78 52,64 51,25 60,69

PTC-FM 214 55,87 58,15 55,61 56,18 25 55,28 57,59 51,86 51,31 149 59,61 57,89 56,18 58,48 20 55,56 60,47 53,6 53,01

PTC-FR 240 56,97 61,26 46,15 58,7 31 59,82 61,81 54,13 60,1 166 56,13 57,27 49,56 57,56 26 57,54 61,82 55,83 61,25

PTC-MM 221 58,9 55,08 53,86 56 32 53,85 52,16 50,35 53,57 158 53,81 53,3 53,56 55,67 28 52,99 50,3 47,9 54,75

PTC-MR 234 59,33 59,27 53,43 58,14 36 62,49 58,14 54,92 59,01 164 52,29 55,45 49,99 51,71 31 60,46 55,19 54,34 59,6

Minimal support = 40%

Datasets # SVM NN NB C4.5 # SVM NN NB C4.5 # SVM NN NB C4.5 # SVM NN NB C4.5

HIA 89 46,94 33,33 52,78 44,58 16 58,47 50,69 49,86 50,14 38 48,89 47,22 48,75 46,25 12 55,97 54,58 47,36 55

PTC-FM 102 54,73 53,6 55,88 55,3 9 57 52,13 59,03 58,17 92 50,97 59,04 55,03 62,49 8 54,73 55 59,03 57,59

PTC-FR 104 58,41 58,12 51,56 61,86 10 54,98 60,11 64,96 65,53 92 56,43 59,53 50,99 63,85 8 50,14 61,83 65,53 65,53

PTC-MM 99 54,79 59,55 54,45 61,34 9 58,04 58,66 62,78 55,94 93 52,99 56,87 53,85 60,42 8 53,54 60,12 63,07 57,13

PTC-MR 103 56,43 53,76 52,87 56,7 9 52,91 55,5 56,39 57,26 92 54,07 54,92 53,17 53,76 8 52,88 56,73 57,84 53,51

Minimal support = 50%

Datasets # SVM NN NB C4.5 # SVM NN NB C4.5 # SVM NN NB C4.5 # SVM NN NB C4.5

HIA 52 41,11 50,28 57,36 44,31 11 55,14 51,25 53,75 46,53 20 48,75 54,72 54,58 48,89 9 54,86 51,25 55 45,14

PTC-FM 65 54,16 53,92 54,74 62,2 9 54,16 61,33 55,88 61,61 61 47,02 58,19 55,31 61,34 8 50,16 59,03 55,31 61,61

PTC-FR 66 56,41 62,67 55,85 63,56 9 47,83 62,11 65,25 65,53 62 54,13 61,53 55,56 63,56 8 51,01 63,8 65,53 65,53

PTC-MM 65 58,59 58,94 53,54 58,92 8 51,46 57,74 60,11 61,91 61 52,38 55,66 53,54 58,91 7 52,64 61,88 59,21 60,1

PTC-MR 62 51,46 52,31 54,31 55,22 11 52,87 56,14 56,4 59,33 58 50,91 56,97 53,45 54,38 10 50,85 58,16 56,95 60,18

• SVM: Support Vector Machine [16];• NN: Nearest Neighbors [17];• NB: Naive Bayesian [18];• C4.5: Decision Trees[14].

119