24
Rochester Institute of Technology Mining Frequent subgrapgs - gSpan with closed graph Submitted by: Ankita Sambhare Advisor: Dr. Carlos Rivero May 13, 2017

Rochester Institute of Technology - Computer SciencegSpan changes the graph mining problem to a sequential pattern mining problem. This is reason for the e ciency of gSpan. 2. Mining

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Rochester Institute of Technology - Computer SciencegSpan changes the graph mining problem to a sequential pattern mining problem. This is reason for the e ciency of gSpan. 2. Mining

Rochester Institute of Technology

Mining Frequent subgrapgs -gSpan with closed graph

Submitted by:Ankita Sambhare

Advisor:Dr. Carlos Rivero

May 13, 2017

Page 2: Rochester Institute of Technology - Computer SciencegSpan changes the graph mining problem to a sequential pattern mining problem. This is reason for the e ciency of gSpan. 2. Mining

Mining Frequent subgrapgs - gSpan with closed graph

Contents

1 Abstract 2

2 Introduction 3

3 Background Research 43.1 Frequent subgraph mining . . . . . . . . . . . . . . . . . . . . . . . . 43.2 Approximate graph pattern mining . . . . . . . . . . . . . . . . . . . 43.3 Graph pattern summarization . . . . . . . . . . . . . . . . . . . . . . 53.4 Graph classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.5 Graph clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.6 Graph indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.7 Graph searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.8 Correlated graph pattern mining . . . . . . . . . . . . . . . . . . . . . 73.9 Optimal graph pattern mining . . . . . . . . . . . . . . . . . . . . . . 83.10 Graph kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.11 Link mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.12 Web structure mining . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.13 Work-flow mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.14 Biological network mining . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Frequent Subgraph Mining 11

5 gSpan - graph-based Substructure Pattern Mining 125.1 DFS Subscripting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135.2 DFS Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.3 DFS Lexicographic order . . . . . . . . . . . . . . . . . . . . . . . . . 145.4 Minimum DFS Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.5 DFS Code Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.6 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6 Experiments 196.1 DataSets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.2 Dataset 1 - 340 graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 196.3 Dataset 2 - 10000 graphs . . . . . . . . . . . . . . . . . . . . . . . . . 21

7 Future Work 22

1

Page 3: Rochester Institute of Technology - Computer SciencegSpan changes the graph mining problem to a sequential pattern mining problem. This is reason for the e ciency of gSpan. 2. Mining

Mining Frequent subgrapgs - gSpan with closed graph

1 Abstract

Graphs are frequently used to model complex data structures and represent therelationship between the data. It has its application in effectively representing datarelated to biological information, social networks, web, etc. To find common patternsamong a set of graphs, sub-graph mining is performed. gSpan performs sub-graphmining by ordering the graphs in a lexicographically using the DFS code. It alsoassociates each graph with a minimum DFS code which can be termed as its label.It then performs depth-first search for mining frequent patterns. Using DFS CodesgSpan changes the graph mining problem to a sequential pattern mining problem.This is reason for the efficiency of gSpan.

2

Page 4: Rochester Institute of Technology - Computer SciencegSpan changes the graph mining problem to a sequential pattern mining problem. This is reason for the e ciency of gSpan. 2. Mining

Mining Frequent subgrapgs - gSpan with closed graph

2 Introduction

Graph mining is a wide topic which has gained its popularity in the recent years.This popularity is owed to the variety of domains which has the scope to be modelsas graphs and mine relevant information. Some of these domains are bioinformatics,networking, social networks, human interactions, etc. However, the data representa-tion as well as the size of the dataset changes, the overall idea of graph representationand graph mining remains the same[2].

Under graph mining, there exists 14 mining domains. Each of these domain insuitable for specific problems and a collection of these problems are clubbed to givemining techniques and these sub domains[5].

The 14 graph mining sub domains are as follows[5]:

1. Frequent subgraph mining

2. Approximate graph pattern mining

3. Graph pattern summarization

4. Graph classification

5. Graph clustering

6. Graph indexing

7. Graph searching

8. Correlated graph pattern mining

9. Optimal graph pattern mining

10. Graph kernels

11. Link mining

12. Web structure mining

13. Work flow mining

14. Biological network mining

3

Page 5: Rochester Institute of Technology - Computer SciencegSpan changes the graph mining problem to a sequential pattern mining problem. This is reason for the e ciency of gSpan. 2. Mining

Mining Frequent subgrapgs - gSpan with closed graph

3 Background Research

3.1 Frequent subgraph mining

The main essence of this approach is to generate and grow candidate sub graphsfrom the graph under consideration and then count the occurrences for these subgraphs while comparing it with a threshold. It can be performed on a set of graphsas well as a single massive graph. The use of frequent subgraphs is seen majorly incharacterizing, classifying and clustering graphs as well as to perform indexing. Themain essence of this technique is to generate candidate subgraphs which is also thecrux of the whole technique[2].Mining molecular fragments: finding relevant substructures of molecules[6]. In thispaper the authors have tried to find different types of non-redundant substructuresin different types of atoms. They have experimented with molecules containingNitrogen atom, Sulfur atoms, Selenium based fragments and aromatic bands. Theirapproach is to use an induction algorithm where they start from an atom or a corestructure to develop a tree of all possible bonds that can be formed in a particularmolecule to reach the required sub structures. Although in the paper this techniqueis used to limit the finding of a sub structure to just ones. This algorithm can alsobe used to identify frequent sub graphs or sub structures in a molecule.

3.2 Approximate graph pattern mining

The problem under considerations are the scenarios of biological data, social net-works and other applications like the Web where the data (graph) is massive in sizeand has complex data structures. For such scenarios, approximate graph patternmining techniques are better than the exact graph mining techniques due to thepossibilities of false positives. Due to the size and nature of the data, noise anddata diversity are inevitable components. To handle this noisy data, an approxima-tion is required to manage variations within threshold of the potentially interestingpatterns[2].The two main problems in the process of mining such massive and complex graphsare:

1. To mine frequently occurring patterns in a single graph, the graphs needs tobe partitioned into regions where the pattern appears once in each region. Theregions may overlap, however for each pattern the partitioning changes.

2. To approximate the potentially interesting patterns due to the noise in thedata. This means patterns having similarity about a threshold value can be

4

Page 6: Rochester Institute of Technology - Computer SciencegSpan changes the graph mining problem to a sequential pattern mining problem. This is reason for the e ciency of gSpan. 2. Mining

Mining Frequent subgrapgs - gSpan with closed graph

said to be identified together.Example: Social Network

3.3 Graph pattern summarization

The Frequent subgraph patterns extracted from the super graph after mining aremost of the time very large. This makes examining/exploiting the extracted sub-graphs a huge task for the user. This cause a bottleneck in the process of knowledgediscovery on graphs. This bottleneck is not due to the efficiency or scalability butdue to the usability of the mined patterns.

The patterns mined currently have 2 problems:

1. Due to the noise in the data and its diversity, the patterns that are minedare not very useful in their exact form. This brings us to having two ormore patterns that are quite similar to each other and which after propercorrespondence, can be viewed as similar.

2. The structural requirements also results in generating an excessive amount ofpatterns that barely differ from each other. Mining such a large data set is alengthy task as well as pretty redundant.

Thus graph pattern summarization algorithms are used. One of the approachesto solve this problem was to use a smoothing-clustering framework. As the namesuggests, the smoothing phase will remove the error factor followed by the clusteringphase which will reduce the cardinality by collapsing similar patterns together.In the smoothing phase, the patterns that are similar to each other are treated asequivalent. These patterns are those that have identical edges except a thresholdvalue. This threshold value is the error tolerance that blurs the rigid boundaries.This is basically approximation performed on the patterns. In the second phase,which is the clustering phase, the patterns with blurred boundaries look identicaland thus will be treated as equivalent. Thus instead of viewing all the patterns,clustering is performed where the cluster centers are the representatives for thatcluster[2].

3.4 Graph classification

The idea is similar to data classification algorithms where we have the target classof a few training data set graphs and we try to derive the target class of the testdata set graphs. The classification can be either supervised or unsupervised. Insupervised graph classification, training data is used to learn and create a model

5

Page 7: Rochester Institute of Technology - Computer SciencegSpan changes the graph mining problem to a sequential pattern mining problem. This is reason for the e ciency of gSpan. 2. Mining

Mining Frequent subgrapgs - gSpan with closed graph

where each graph has its expected target value. However, in unsupervised graphclassification, the graph is classified into categories on the basis of their similarityvalue.The two learning tasks considered in graph classification are as follows:

1. Label propagation:In this method, in a large graph, a subset of nodes are labeled.

2. Graph classification:In this method, in a set of graphs, a subset of graphs are labeled. Then, themodel is expected to classify the unlabeled graphs by learning the model fromthe labeled graph.

An example would be Retail sellers tracking down its consumers on the basis of thepromotions. Like they will characterize the customers into two categories: the oneswho respond to their promotions being the positive ones and the ones who do notrespond being the negative ones. Using this information, more promotions are sentto the positive customers. This helps the model learn about the characteristics ofother customers and predict their response. Under the hood, the labels are expectedto travel through the graphs on to the unlabeled nodes[2].

3.5 Graph clustering

The idea here is to find the vertices’s related to each other within a graph andcluster them together. The idea is pretty similar to the traditional data clusteringalgorithm.

The main idea behind graph clustering is that having a set of objects (graphs/nodes),categories them on the basis of their similarity and divide them into groups. Eachgroup represents a cluster. A mathematically defined objective function is used tocalculate the similarity between these objects.

The two categories of graph clustering algorithms are as follows:

1. Node Clustering Algorithms:

Node clustering algorithms are applicable when we have one large graph underconsideration. In this scenario, the individual nodes in the large graph areclustered on the basis of the similarity value or the distance on the edgesconnecting the nodes.

2. Graph Clustering Algorithms:

6

Page 8: Rochester Institute of Technology - Computer SciencegSpan changes the graph mining problem to a sequential pattern mining problem. This is reason for the e ciency of gSpan. 2. Mining

Mining Frequent subgrapgs - gSpan with closed graph

Graph Clustering Algorithms are applicable when we have a huge set of graphsunder consideration. In this scenario, multiple graphs from the set are clus-tered together on the basis of their structural behavior. This is more complexproblem due to the requirement of matching structural behavior of multiplegraphs.

One of the application of graph clustering algorithm is in graph pattern summarization[2].

3.6 Graph indexing

The use of graph and tree data structure has evolved due to the use of XML asa data exchange format. The problem under consideration is that having a set ofgraphs in the database, find the set of graphs that best matches the graph patternin the query.The need for these techniques arises from the large and complex structural datain bioinformatics, chemical bonding, etc. To store and retrieve this graphical dataefficiently, graph indexing is the key component. Due to the size of the graphand also the graph database, it is very inefficient to sequentially scan the entiregraph database. Another problem with sequentially scan is that it faces subgraphisomorphism which is a NP-complete problem. Thus graph indexing is required toprune graphs efficiently and give better performance[2].

3.7 Graph searching

Graph searching algorithms majorly constitutes of traversing the graph in the mostoptimal manner to search through the complex and connected graph world efficiently.Due to the advancement in the database systems and its support for complex datastructures, it gives rise to a wide range or challenges. The basic support require-ment of these database system is to handle searching through these complex datastructures efficiently. One of the approach to handle this is using sub structuredsearch where a similarity search is performed. This is due to the restrictive behaviorof matching exact patterns in a huge graph[2].

3.8 Correlated graph pattern mining

Correlated graph pattern mining is not a single technique but a super domain oftechniques which can be classified as correlated. It essentially concentrates on rela-tionship between nodes. It crux lies in the similarity factor between a set of graphs.

7

Page 9: Rochester Institute of Technology - Computer SciencegSpan changes the graph mining problem to a sequential pattern mining problem. This is reason for the e ciency of gSpan. 2. Mining

Mining Frequent subgrapgs - gSpan with closed graph

Correlated graph pattern mining has its application in a variety of scenarios fromsearch graphs relatively similar to a given graph to finding frequent sub-graphs. Tofind the similarity (correlation) between a set of graphs, a variety of parameters canbe used but the most common is the similarity (correlation) among the nodes ofthe graphs under consideration. Any technique which concentrates on the similarityof the nodes in a graph can thus be classified as correlated graph pattern miningtechnique. These techniques are best suited for applications which requires spec-tral clustering, social graph network, biological graph network, frequent hypercliquepattern mining[2] [7].

3.9 Optimal graph pattern mining

It is practically infeasible to enumerate all the patterns from a set of graphs dueto the size of the graphs. This is where optimal graph pattern mining comes intopicture to handle the scalability problems associated with complex graph miningapplication like social graph network mining, biological graph network mining, etc.

A few approaches within this domain are mining by leap search or Correspondence-based Quality Criterion (CORK). In leap search, the main idea is to iteratively applythe objective function to the mined graph patterns which returns its objective score.Using this score, the most significant score is used for branching out and prune theinsignificant scores[2] Another approach involves feature selection which starts of myfinding frequent graph, applying gSpan for reducing the search space and the finallyuse CORK to apply branch and bound to the search space. A feature is consideredimportant as far as its quality criteria improved throughout the process[8].

3.10 Graph kernels

Graph kernels are more or less like functions which can be applied to graph andyield valuable cryptic information. Its major application lies in chemical compoundanalysis, where kernels are used to compare the molecules directly in their graphicalformat. A variety of graphs kernels exist, but one of the interesting one is thecyclic pattern kernel for predictive graph mining which ignores the frequency of thepatterns and concentrates on the mapping natural sets of patterns. It is best usedto identify cycles and trees in a set of graphs[3].

8

Page 10: Rochester Institute of Technology - Computer SciencegSpan changes the graph mining problem to a sequential pattern mining problem. This is reason for the e ciency of gSpan. 2. Mining

Mining Frequent subgrapgs - gSpan with closed graph

3.11 Link mining

The entire crux of link mining concentrates on the relationship between the data. Ithelps classify data which have links together and its application can be seen in webpages bibliography. In the world of web pages, each pages is considered as a node inthe graph and any links in a web page to another web pages will be considered as anedge. These nodes are then classified on the basis of the edge-relationship betweenthe nodes. In the world of bibliography, this challenge is a bit more complex thoughvery similar to web. Each paper or article will be considered as a node in thegraph and any reference or citations from a resource to another will be consideredthe links(edges) between them. These links are then used to cluster the resources.The link mining can be performed using the relationship between the nodes or therelationship between the attributes of the nodes[4].

A few tasks involved in link mining are as follows:

1. Link based classification

2. Link based cluster analysis

3. Link type

4. Link strength

5. Link cardinality

3.12 Web structure mining

Web structure mining is a mining technique designed for the web pages. It handlesthe hyperlinks and thus link mining is a sub domain of web structure mining. PageRank Model is a famous web structure mining technique which is an iterative algo-rithm for link mining. Another technique used the HITS concept which classifies thepages as authorities or hubs pages. Authority pages are pages with a great amountof relatable content, whereas hubs pages are pages with great amount of links toeither authority pages or other hubs pages[2].

3.13 Work-flow mining

Work-flow mining is the process which scans event logs to discover process models.Due to the importance of event logs, the structure of these logs play a very vitalrole. The best way to explain the discovery of class of workflow processes which areretrievable on the basis of the event logs is using the petri net example. Some of the

9

Page 11: Rochester Institute of Technology - Computer SciencegSpan changes the graph mining problem to a sequential pattern mining problem. This is reason for the e ciency of gSpan. 2. Mining

Mining Frequent subgrapgs - gSpan with closed graph

approaches uses heuristic functions to handle the noise in the real world data andalso incomplete event logs. One of the approaches to determine the efficiency andaccuracy of the model is to trace it back through the event log. This will not workif there exists any noise in the event logs of if the logs are incomplete.

3.14 Biological network mining

This technique includes networks which are dense and similar to the social medianetworks. These networks are very complex due to its dense nature. Example: mod-ular network. Due to its complexity, it allows users to have heterogeneous datasets.Many algorithms which work with biological network mining use Markov’s cluster-ing algorithm which uses the proteins and genes as nodes and their interactions asedges. Graph mining is then performed on these graphs.

10

Page 12: Rochester Institute of Technology - Computer SciencegSpan changes the graph mining problem to a sequential pattern mining problem. This is reason for the e ciency of gSpan. 2. Mining

Mining Frequent subgrapgs - gSpan with closed graph

4 Frequent Subgraph Mining

Subgraph mining in the technique involving the discovery of subgraphs in a graphdatabase containing a set of graphs. This discovery if performed on the basis of theinterestingness of a subgraphs. Interestingness can be gazed based on the occurrencesof the subgraph in the database. The more frequently occurring subgraphs will bemore interesting than otherwise. The subgraph mining technique which is based onoccurrence is called frequent subgraph mining.

Frequent subgraph mining takes the set of graphs or a large graph as input alongwith a minimum support value. The minimum support value represents the num-ber of occurrences of the subgraph in the database. In case of a set of graphs, thenumber of occurrences(frequency) are enumerates as the number of graphs in thedataset in which the subgraphs occurs. However, in case of a massive single graphs,the graph is divided into regions and the regions can overlap. the number of occur-rences(frequency) are enumerated as the number of regions in which the subgraphappears. These regions are different for different subgraphs.

Figure 1: Frequent Subgraph Mining with 2 graphs and minimum support = 2

11

Page 13: Rochester Institute of Technology - Computer SciencegSpan changes the graph mining problem to a sequential pattern mining problem. This is reason for the e ciency of gSpan. 2. Mining

Mining Frequent subgrapgs - gSpan with closed graph

5 gSpan - graph-based Substructure Pattern Min-

ing

gSpan is based of depth first search(DFS) and is the first algorithm to do so. gSpanintroduced two new techniques to represent graph: DFS lexicographic order andminimum DFS code. Using these two techniques, gSpan labels subgraphs withnovel canonical labels, supporting the DFS search. DFS does not perform candidategeneration and thus does not give false positives. It outperforms other algorithmsfor subgraph mining by simultaneously growing and checking frequent subgraphs.

gSpan performs the following:

1. DFS Subscripting

2. DFS Code

3. DFS Lexicographic order

4. Minimum DFS Code

5. DFS Code Tree

12

Page 14: Rochester Institute of Technology - Computer SciencegSpan changes the graph mining problem to a sequential pattern mining problem. This is reason for the e ciency of gSpan. 2. Mining

Mining Frequent subgrapgs - gSpan with closed graph

5.1 DFS Subscripting

As gSpan is based on DFS, one graph can be traversed in a variety of ways, thuscreating multiple possible DSF trees. DFS tree created when DFS is performed ona graph.

Figure 2: DFS Subscripting[9]

As shown in fig(2), fig(b− d) are isomorphic to fig(a). Also, figure (b-d) show theDFS subscripting. DFS subscripting in the order in which the nodes are discoveredwhile traversing the graph using DFS. The subscripts are created as per the discoveryin time. If, nodes are subscripted vi and vj, and if i¡j, then vi was discovered beforevj. Also the root of the tree is always v0, and vn node is the rightmost vertex. Also,if we take straight traversal from v0 to vn, this path is called the rightmost path inthe DFS tree.

Another concept to be looked at are the forward and the backward edges. Forwardedges are all the edges that are a part of the DFS tree, where as backward edges arethe edges that are not a part of the DFS tree. If we consider an edge from node vito vj, and if i¡j the this edge is considered as a forward edge, else it is considered as abackward edge. While growing the tree, a backward edge can only be extended fromthe rightmost vertex in the current tree, whereas a forward edge can be extendedfrom any vertex on the rightmost path.

A linear order based on the subscripting is formulated among all the edges as shownin figure 3 []:

13

Page 15: Rochester Institute of Technology - Computer SciencegSpan changes the graph mining problem to a sequential pattern mining problem. This is reason for the e ciency of gSpan. 2. Mining

Mining Frequent subgrapgs - gSpan with closed graph

Figure 3: Linear Order Formulation[9]

5.2 DFS Code

The DFS code is the representation of an edge in the form of 5-tuple. The 5-tupleformat is (i, j, li, l(i, j), lj) which represent an edge between node vi and vj. li andlj are the labels of node vi and vj respectively and l(i, j) is the label for the edgebetween node vi and vj. Table 1 shows the DFS code for the figure 2.

Figure 4: DFS Code Formulation[9]

As gSpan does not perform candidate generation and branch and bound technique,String matching is the method through which gSpan perform graph matching. Sotwo graphs are compared by compairing their DFS Codes to check for isomorphism.

Figure 5: DFS Subscripting[9]

5.3 DFS Lexicographic order

From the multiple DFS codes obtained above, we need to have a common measurefor the multiple DFS codes to be mapped to one single DFS code thus allowing thematching of subgraphs and checking isomorphism. This is performed by the formulain figure 6.

14

Page 16: Rochester Institute of Technology - Computer SciencegSpan changes the graph mining problem to a sequential pattern mining problem. This is reason for the e ciency of gSpan. 2. Mining

Mining Frequent subgrapgs - gSpan with closed graph

Figure 6: Lexicographic ordering Formulation[9]

This order when performed on graphs in figure 2, will be ordered as shown in figure7.

Figure 7: Lexicographic ordering of DFS code[9]

5.4 Minimum DFS Code

The formula for determining the minimum DFS code and its importance is describedin figure 8.

From figure 8, it is clear that the graph matching problem will boil down to stringmatching problem if we compute the DFS codes and determine minimum DFS code.This computation will still be better than solving graph isomorphism problem whichis an NP complete problem. Using this we transformed the graph mining probleminto a sequential pattern mining problem[9].

15

Page 17: Rochester Institute of Technology - Computer SciencegSpan changes the graph mining problem to a sequential pattern mining problem. This is reason for the e ciency of gSpan. 2. Mining

Mining Frequent subgrapgs - gSpan with closed graph

Figure 8: Minimum DFS Code Formulation[9]

Figure 9: Minimum DFS Code[9]

5.5 DFS Code Tree

The figure 10, shows the basis for creating the DFS code tree on which DFS isperformed.

The search space of the dataset can contain an infinite number of graphs. Thesegraph datasets size does not affect the overall performance as much as the minimumsupport. As gSpan first finds the frequency of all the edges, and considers only thoseedges for graph mining whose frequency is greater than or equal to the minimumsupport frequency, this step greatly reduces the search space allowing the user tomine efficiently.

After this is tree is created as shown in figure 11. It shows that at each level, an edgeis added to the graph and checked for this subgraphs frequency. If this subgraph has

16

Page 18: Rochester Institute of Technology - Computer SciencegSpan changes the graph mining problem to a sequential pattern mining problem. This is reason for the e ciency of gSpan. 2. Mining

Mining Frequent subgrapgs - gSpan with closed graph

Figure 10: DFS Code Tree Formulation[9]

frequency greater than or equal to the minimum support frequency, then this graphwill be considered for further extensions as well as a candidate output fragment.

Figure 11: DFS Code tree - edge based[9]

17

Page 19: Rochester Institute of Technology - Computer SciencegSpan changes the graph mining problem to a sequential pattern mining problem. This is reason for the e ciency of gSpan. 2. Mining

Mining Frequent subgrapgs - gSpan with closed graph

5.6 Algorithm

The overall Algorithm is as mentioned in figure 12 using the steps mentioned aboveand performed sequentially.

Figure 12: Algorithm[9]

18

Page 20: Rochester Institute of Technology - Computer SciencegSpan changes the graph mining problem to a sequential pattern mining problem. This is reason for the e ciency of gSpan. 2. Mining

Mining Frequent subgrapgs - gSpan with closed graph

6 Experiments

6.1 DataSets

The initial plan was to develop subgraph mining technique and test it against thecriminal dataset at hand. However, mid-way into the project, we realized that thecriminal database is a single massive graph and cannot be divided into multiplegraphs. This led to changing the datasets to a set of graph datasets available online. Majority of these datasets were chemical compounds. Two such chemicaldatasets were used to test the accuracy as well as the time efficiency of the parsemisgSpan framework[1].

The two datasets and associated results were as follows:

6.2 Dataset 1 - 340 graphs

This dataset consists of 340 graphs which had alot of edges. As shown in figure13, this explains the high number of returned fragments by gaston which is anotherfrequent subgraph mining algorithm. However, due to the vast amount of minedfragments, the importance of each fragment is reduced and most of the fragmentsmined by gaston for this dataset when frequency is 5% seem to be irrelevant. It isalso evident that when the frequency is 10%, the output fragments drop significantly.

However, with gSpan, the results seem proportional to the frequency and dataset.

Figure 13: 340 graphs - output fragments - gSpan vs gaston

19

Page 21: Rochester Institute of Technology - Computer SciencegSpan changes the graph mining problem to a sequential pattern mining problem. This is reason for the e ciency of gSpan. 2. Mining

Mining Frequent subgrapgs - gSpan with closed graph

If we check the run time for both gaston and gSpan for this dataset, they are verysimilar and has barely any difference (figure 14).

Figure 14: 340 graphs - Run Time - gSpan vs gaston

20

Page 22: Rochester Institute of Technology - Computer SciencegSpan changes the graph mining problem to a sequential pattern mining problem. This is reason for the e ciency of gSpan. 2. Mining

Mining Frequent subgrapgs - gSpan with closed graph

6.3 Dataset 2 - 10000 graphs

This dataset consists of 10000 graphs which has sparse edges. As shown in figure15, gaston could not perform graph mining on the dataset with 10000 graphs andminimum frequency of 5%. It runs out of heap space as the graph is store in memory.However, the same computation is performed very quickly with gSpan due to theuse of DFS codes. However, the rest of the results of output fragments are similar.

Figure 15: 10000 graphs - output fragments - gSpan vs gaston

If we check the run time for both gaston and gSpan for this dataset, it is very clearthat gSpan runs faster than gaston as well as runs for all the minimum supportfrequencies where as gaston fails for smaller minimum support frequencies and alsois slower than gSpan for the rest as well (figure 16).

One more thing to notice is that the above results were performed using closeGraphparameter which does not consider redundant subgraphs or mined subgraphs thusallowing the user to only concentrate on relevant subgraphs.

To parse these different datasets parsers were written to help transform the chemicaldata into graph data which can be then fed into gSpan and gaston to yield results.These parsers would convert the different datasets into graphML data which hasspecific syntax which allows gSpan to parse it. Small changes were made to gSpan’sgraphML parser to accept these dataset and also to the mining algorithm to outputthe fragments in graphML format.

21

Page 23: Rochester Institute of Technology - Computer SciencegSpan changes the graph mining problem to a sequential pattern mining problem. This is reason for the e ciency of gSpan. 2. Mining

Mining Frequent subgrapgs - gSpan with closed graph

Figure 16: 10000 graphs - Run Time - gSpan vs gaston

7 Future Work

gSpan is a Frequent Subgraph Mining technique which uses exact matching tech-nique. However, in real world data, there exists a lot of noise when the data iscollected or the data might also be incomplete. To compensate for this, an er-ror quotient can be introduced which will allow matching subgraphs with an errorthreshold. This will also help mine more significant structures or subgraphs usinggSpan. A lot of research has been performed on this under approximate sub graphmatching. In approximate subgraph matching, the idea is to blur the edges of thesubgraph if need be and the cluster the subgraphs together if after blurring, theerror threshold holds true.

22

Page 24: Rochester Institute of Technology - Computer SciencegSpan changes the graph mining problem to a sequential pattern mining problem. This is reason for the e ciency of gSpan. 2. Mining

Mining Frequent subgrapgs - gSpan with closed graph

References

[1] gspan framework, howpublished = https://www.cs.ucsb.edu/~xyan/

software/gspan.htm.

[2] Charu C. Aggarwal and Haixun Wang. Managing and Mining Graph Data.Springer Publishing Company, Incorporated, 1st edition, 2010.

[3] Karsten M. Borgwardt, Cheng S. Ong, Stefan Schonauer, S. V. N. Vishwanathan,Alex J. Smola, and Hans-Peter Kriegel. Protein function prediction via graphkernels. Bioinformatics, 21(S1):i47–i56, 2005.

[4] Lise Getoor. Link mining: A new data mining challenge. SIGKDD Explor.Newsl., 5(1):84–89, July 2003.

[5] Chuntao Jiang, Frans Coenen, and Michele Zito. A survey of frequent subgraphmining algorithms. Knowledge Eng. Review, 28(1):75–105, 2013.

[6] T. Meinl and M. R. Berthold. Hybrid fragment mining with mofa and fsg. In2004 IEEE International Conference on Systems, Man and Cybernetics (IEEECat. No.04CH37583), volume 5, pages 4559–4564 vol.5, Oct 2004.

[7] Tomonobu Ozaki and Takenao Ohkawa. Mining Correlated Subgraphs in GraphDatabases, pages 272–283. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008.

[8] X. Yan, X. J. Zhou, and J. Han. Mining closed relational graphs with connectivityconstraints. In 21st International Conference on Data Engineering (ICDE’05),pages 357–358, April 2005.

[9] Xifeng Yan and Jiawei Han. gspan: Graph-based substructure pattern mining. InProceedings of the 2002 IEEE International Conference on Data Mining, ICDM’02, pages 721–, Washington, DC, USA, 2002. IEEE Computer Society.

23