8
A Non-Parametric Spectral Model for Graph Classification Andrea Gasparetto, Giorgia Minello and Andrea Torsello Dipartimento di Scienze Ambientali, Informatica e Statistica Universit´ a Ca’ Foscari Venezia Via Torino 155, 30172 Mestre (VE), Italy {andrea.gasparetto, torsello}@unive.it, [email protected] Keywords: Classification, Statistical Learning Framework, Structural Representation, Graph Model Abstract: Graph-based representations have been used with considerable success in computer vision in the abstraction and recognition of object shape and scene structure. Despite this, the methodology available for learning structural representations from sets of training examples is relatively limited. In this paper we take a simple yet effective spectral approach to graph learning. In particular, we define a novel model of structural representation based on the spectral decomposition of graph Laplacian of a set of graphs, but which make away with the need of one-to-one node-correspondences at the base of several previous approaches, and handles directly a set of other invariants of the representation which are often neglected. An experimental evaluation shows that the approach significantly improves over the state of the art. 1 INTRODUCTION Graph-based representations have been applied with considerable success to several tasks as convenient means of representing structural patterns. Examples include the arrangement of shape primitives or fea- ture points in images, molecules, and social networks [Estrada and Jepson, 2009]. Their success lies in their ability to concisely capture the relational arrangement of primitives, in a manner which can be invariant to irrelevant transformation such as changes in object viewpoint. Despite their many advantages and attrac- tive features, the methodology available for learning structural representations from sets of training exam- ples is relatively limited, and the process of capturing the modes of structural variation for sets of graphs has proved to be elusive. Structural representations are widely adopted in the context of Bayesian networks, or general rela- tional models [Friedman and Koller, 2003], where structural learning processes are used to infer the stochastic dependency between these variables. How- ever, these approaches rely on the availability of cor- respondence information for the nodes of the different structures used in learning. In many cases the identity of the nodes and their correspondences across sam- ples of training data are not known, rather, the corre- spondences must be recovered from structure. In the last few years, there has been some effort aimed at learning structural archetypes and cluster- ing data abstracted in terms of graphs. In this con- text, spectral approaches have provided simple and effective procedures. For example, Luo and Han- cock [Luo et al., 2006] use graph spectral features to embed graphs in a (low) fixed-dimensional space where standard vectorial analysis can be applied. While embedding approaches like this one preserve the structural information present, they do not pro- vide a means of characterizing the modes of structural variation encountered and are limited by the stabil- ity of the graph’s spectrum under structural perturba- tion. Bonev et al. [Bonev et al., 2007], and Bunke et al. [Bunke et al., 2003] summarize the data by cre- ating super-graph representation from the available samples, while White and Wilson [White and Wil- son, 2007] use a probabilistic model over the spec- tral decomposition of the graphs to produce a gen- erative model of their structure. While these tech- niques provide a structural model of the samples, the way in which the super-graph is learned or esti- mated is largely heuristic in nature and is not rooted in a statistical learning framework. Torsello and Han- cock [Torsello and Hancock, 2006] define a super- structure called tree-union that captures the relations and observation probabilities of all nodes of all the trees in the training set. The structure is obtained by merging the corresponding nodes and is critically dependent on the order in which trees are merged.

A Non-Parametric Spectral Model for Graph Classificationgasparetto/publications/2014-ICPRAM-NonParam... · A Non-Parametric Spectral Model for Graph Classification Andrea Gasparetto,

Embed Size (px)

Citation preview

Page 1: A Non-Parametric Spectral Model for Graph Classificationgasparetto/publications/2014-ICPRAM-NonParam... · A Non-Parametric Spectral Model for Graph Classification Andrea Gasparetto,

A Non-Parametric Spectral Model for Graph Classification

Andrea Gasparetto, Giorgia Minello and Andrea TorselloDipartimento di Scienze Ambientali, Informatica e Statistica

Universita Ca’ Foscari VeneziaVia Torino 155, 30172 Mestre (VE), Italy

andrea.gasparetto, [email protected], [email protected]

Keywords: Classification, Statistical Learning Framework, Structural Representation, Graph Model

Abstract: Graph-based representations have been used with considerable success in computer vision in the abstractionand recognition of object shape and scene structure. Despite this, the methodology available for learningstructural representations from sets of training examples is relatively limited. In this paper we take a simple yeteffective spectral approach to graph learning. In particular, we define a novel model of structural representationbased on the spectral decomposition of graph Laplacian of a set of graphs, but which make away with the needof one-to-one node-correspondences at the base of several previous approaches, and handles directly a set ofother invariants of the representation which are often neglected. An experimental evaluation shows that theapproach significantly improves over the state of the art.

1 INTRODUCTION

Graph-based representations have been applied withconsiderable success to several tasks as convenientmeans of representing structural patterns. Examplesinclude the arrangement of shape primitives or fea-ture points in images, molecules, and social networks[Estrada and Jepson, 2009]. Their success lies in theirability to concisely capture the relational arrangementof primitives, in a manner which can be invariant toirrelevant transformation such as changes in objectviewpoint. Despite their many advantages and attrac-tive features, the methodology available for learningstructural representations from sets of training exam-ples is relatively limited, and the process of capturingthe modes of structural variation for sets of graphs hasproved to be elusive.

Structural representations are widely adopted inthe context of Bayesian networks, or general rela-tional models [Friedman and Koller, 2003], wherestructural learning processes are used to infer thestochastic dependency between these variables. How-ever, these approaches rely on the availability of cor-respondence information for the nodes of the differentstructures used in learning. In many cases the identityof the nodes and their correspondences across sam-ples of training data are not known, rather, the corre-spondences must be recovered from structure.

In the last few years, there has been some effort

aimed at learning structural archetypes and cluster-ing data abstracted in terms of graphs. In this con-text, spectral approaches have provided simple andeffective procedures. For example, Luo and Han-cock [Luo et al., 2006] use graph spectral featuresto embed graphs in a (low) fixed-dimensional spacewhere standard vectorial analysis can be applied.While embedding approaches like this one preservethe structural information present, they do not pro-vide a means of characterizing the modes of structuralvariation encountered and are limited by the stabil-ity of the graph’s spectrum under structural perturba-tion. Bonev et al. [Bonev et al., 2007], and Bunke etal. [Bunke et al., 2003] summarize the data by cre-ating super-graph representation from the availablesamples, while White and Wilson [White and Wil-son, 2007] use a probabilistic model over the spec-tral decomposition of the graphs to produce a gen-erative model of their structure. While these tech-niques provide a structural model of the samples,the way in which the super-graph is learned or esti-mated is largely heuristic in nature and is not rootedin a statistical learning framework. Torsello and Han-cock [Torsello and Hancock, 2006] define a super-structure called tree-union that captures the relationsand observation probabilities of all nodes of all thetrees in the training set. The structure is obtainedby merging the corresponding nodes and is criticallydependent on the order in which trees are merged.

Page 2: A Non-Parametric Spectral Model for Graph Classificationgasparetto/publications/2014-ICPRAM-NonParam... · A Non-Parametric Spectral Model for Graph Classification Andrea Gasparetto,

Todorovic and Ahuja [Todorovic and Ahuja, 2006]applied the approach to object recognition based on ahierarchical segmentation of image patches and liftedthe order dependence by repeating the merger proce-dure several times and picking the best model accord-ing to an entropic measure. While these approachesdo capture the structural variation present in the data,the model structure and model parameter are tightlycoupled, which forces the learning process to be ap-proximated through a series of merges, and all theobserved nodes must be explicitly represented in themodel, which then must specify in the same wayproper structural variations and random noise.

In more recent work [Torsello, 2008, Torselloand Rossi, 2011] Torsello and co-workers proposeda generalization for graphs which allowed to de-couple structure and model parameters and used astochastic process to marginalize the set of correspon-dences. The process however still requires a (stochas-tic) one-to-one relationship between model and ob-served nodes and could only deal with size differencesin the graphs by explicitly adding a isotropic noisemodel for the nodes.

In this paper we aim at defining a novel modelof structural representation based on a spectral de-scription of graphs which lifts the one-to-one node-correspondence assumption and is strongly rootedin a statistical learning framework. In particular,we follow White and Wilson [White and Wilson,2007] in defining separate models for eigenvaluesand eigenvectors, but cast the eigenvector model interms of observation over an implicit density func-tion over the spectral embedding space, and we learnthe model through non-parametric density estima-tion. The eigenvalue model, on the other hand, is as-sumed to be log-normal, due to consideration similarto [Aubry et al., 2011].

2 SPECTRAL GENERATIVEMODEL

Let G = (V,E) be a graph, where V is the set ofnodes and E ⊆ V ×V is the set of edges, and let A =(ai j) be its adjacency matrix. The degree d of a nodeis the number of edges incident to the node and it canbe represented through the degree matrix D = (di j)which is a diagonal matrix with dii = ∑ j ai j. Startingfrom these two matrix representations of a graph, itis possible to compute the Laplacian matrix, which isdefined as the difference between the degree matrix Dand the adjacency matrix A:

L = D−A

The Laplacian is a symmetric positive-definitematrix. Its lower eigenvalue is equal to 0 with multi-plicity equal to the number of connected componentsin G. Further, the Laplacian is associated with randomwalks over the graph and it has been extensively usedto provide spectral representations of structures [?].The spectral representation of the graph can be ob-tained from the Laplacian through singular value de-composition. Given a Laplacian L, its decompositionis L = ΦΛΦT , where Λ = diag(λ1,λ2, ...,λ|V |) is thematrix whose diagonal contains the ordered eigenval-ues, while Φ = (φ1|φ2|...|φ|V |) is the matrix whosecolumns are the ordered eigenvectors. This decom-position is unique up to a permutation of the nodes ofthe graph, a change of sign of the eigenvectors, or achange of basis over the eignespaces associated with asingle eigenvalue, i.e., the following properties hold:

L ' PLPT = PΦΛ(PΦ)T (1)L = ΦΛΦ

T = ΦSΛSΦT (2)

L = ΦΛΦT = ΦBλΛBλΦ

T (3)where ' indicates isomorphism of the underlyinggraphs, P is a permutation matrix, S is a diagonal ma-trix with diagonal entries equal to ±1, and Bλ is ablock-diagonal matrix with the block diagonal corre-sponding to the eigenvalues equal to λ in Λ and is or-thogonal while all the remaining diagonal blocks areequal to the identity matrices.

Our goal is to devise a model for the graph spectrathat can capture the main modes of variation presentin a set of sample graphs, and that takes into accountthe invariances of the spectral representation. Fol-lowing [White and Wilson, 2007] we make two sepa-rate and independent models for the eigenvalues andeigenvectors of the Laplacian:

P(G|Θ) = P(ΛG|ΘΛ)P(ΦG|ΘΦ) (4)where Θ is the graph-class model divided into itseigenvalue-model component ΘΛ and eigenvector-model component ΘΦ.

For the eigenvalue model we follow [Aubry et al.,2011] and opt to model the observation distributionof a single eigenvalue as a log-normal distribution.In [Aubry et al., 2011] it was shown that this modelderived directly from rather straightforward stabilityconsiderations derived from matrix perturbation the-ory. As a result, we model the set of eigenvalues as aseries of independent log-normal distribution, one pereigenvalue used, resulting in:

P(ΛG|ΘΛ) = (2π)d2

d

∏i=1

1λiσi

exp(−(lnλi−µi)

2

2σ2i

)(5)

Page 3: A Non-Parametric Spectral Model for Graph Classificationgasparetto/publications/2014-ICPRAM-NonParam... · A Non-Parametric Spectral Model for Graph Classification Andrea Gasparetto,

where λi and µi are model parameters to belearned from data and d is the number of eigenval-ues/eigenvectors used in the model.

On the other hand, the eigenvector component ismodelled as an unknown distribution F on the d-dimensional spectral embedding space Ωd ⊆Rd . Thed-dimensional spectral embedding of a graph is ob-tained from the eigenvector matrix ΦG by taking itsfirst d columns, corresponding to the eigenvectors as-sociated with the d smallest eigenvalues, excludingthe trivial constant eigenvector corresponding to a 0eigenvalue. With the reduced n× d eigenvector ma-trix Φ at hand, we take its rows to be points in the ddimensional spectral embedding space Ωd .

Note that there is a length invariance in the eigen-vectors, which are usually assumed to be of unit Eu-clidean norm. This, however, results in a size com-pression of the spectral embedding points as the graphsize grows. To correct this issue we scale the embed-ding vectors by multiplying them by the graph size n.

With this model we cast the learning phase intoa non-parametric density estimates of the distributionof the spectral embedding points φG

1 , . . . ,φGn . Under

these assumptions, the eigenvector model parameterΘΦ is constituted of a collection of N d-dimensionalvectors θΦ

1 , . . . ,θΦN corresponding to samples from the

unknown density function. In the learning phase theseare obtained aligning and merging spectral embed-ding points from the sample graphs belonging to eachclass.

This per-vertex sample approach takes care of thepermutational invariance, but we still need to explic-itly deal with the other invariances, i.e., the sign ofeigenvectors and choice of an eigenbasis. We solvethose invariances by optimizing over the respectivetransformation groups. Furthermore, we lift the blockconstraint over the eigenbasis selection, relaxing it toan optimization over the orthogonal group O(d). Thisresults in the following definition of the eigenvalueprobability:

P(ΦG|ΘΦ) =

maxR ∈O(d)

maxS∈±1d

1Nhd

d

∏i=1

N

∑j=1

exp

(−‖R SφG

i −θΦj ‖2

2h2d

)(6)

which is the product of Parzen-Rosenblatt kernel den-sity estimators. φG

i is the vector obtained taking thefirst d elements of the i-th row of the eigenvector ma-trix ΦG and θΦ

j is the j-th component of the eigen-vector model ΘΦ. Here we assume that the model issimply an array of samples from the graph class.

In this work we use Silverman’s rule-of-thumb [Silverman, 1986] for the multivariate case toestimate the bandwidth parameter h.

h =

(N

d +24

)− 1d+4

σ (7)

where σ is computed as the squared root of the traceof the covariance matrix Σ of the eigenvector modeldivided by the number of nodes of the model

σ =

√1n

Tr(Σ) (8)

2.1 Model Learning

The learning process aims to estimate the param-eters for the eigenvector and eigenvalue models.Given a set of graphs G = G1,G2, . . . ,Gm, be-longing to the same class C , we firstly com-pute their spectral decomposition, obtaining the set(ΦC

1 ,ΛC1 ),(ΦC

2 ,ΛC2 ), . . . ,(ΦC

m ,ΛCm ). In particular,

the ΦCi s are composed by column vectors which are

the first d non-trivial eigenvectors of the Laplacianmatrix of the corresponding graph, while the ΛC

i scontain the first d non-zero eigenvalues. Hence, drepresents our embedding dimension. The eigenvec-tor model of the class C , denoted as ΦC , is definedas

ΦC =

φ1

1 φ12 . . . φ1

dφ2

1 φ22 . . . φ2

d...

......

...φm

1 φm2 . . . φm

d

where φi

j denotes the j-th non-trivial eigenvector (stilla column vector) of the i-th graph of the set G. Inother word, we perform a vertical concatenation ofall the eigenvectors matrices of the graphs that belongto class C . Thus, the dimension of the eigenvectormodel of the class is (∑m

i=1 ||Gi||)×d.

2.1.1 Estimating the Eigenvector Sign-Flips

The eigenvector matrix produced by the eigendecom-position is unique up to a sign factor. Since ourmethod characterize every node of a graph with afeature vector, a sign disambiguation is mandatory.There are several techniques that allow to detect andsolve this ambiguity, like using the correlation be-tween two functions (i.e. probability density func-tions). If the correlation grows after a flip, then theeigenvector sign should be flipped. Unfortunately,with increasing size, this method becomes computa-tionally heavy.

Page 4: A Non-Parametric Spectral Model for Graph Classificationgasparetto/publications/2014-ICPRAM-NonParam... · A Non-Parametric Spectral Model for Graph Classification Andrea Gasparetto,

For such reason, we have to employ an heuristic-based method in order to solve the sign-ambiguityproblem. Since it is an heuristic approach, it doesnot guarantee the discovery of all the correct signs.Given two graphs GA and GB, which belong to thesame class C , let ΦA

j and ΦBj be the j-th eigenvectors

of the spectral representation of the graphs. We as-sume eigenvectors to be random variables having un-known probability density function. We assume thatall the j-th eigenvectors of graphs in the same classshare a very similar pdf among them, up to the sign.A flipped sign does not influence the shape of a pdf,but the peak of the function results shifted. Once areference graph is selected (for example, A), the signambiguity is solved by checking the sign of the peaksof each eigenvector of the reference graph and the oth-ers. An eigenvector is flipped when the signs of thepeaks are different.

φBj =

φB

j (−1) if xA∗j < 0 and xB∗

j ≥ 0 ,φB

j (−1) if xA∗j ≥ 0 and xB∗

j < 0 ,φB

j otherwise.(9)

The pdfs of each eigenvectors are estimated us-ing kernel density estimation. The density estimatesare evaluated at 100 points covering the range of theeigenvectors. Those evaluations are then used to findthe peaks xA∗

j and xB∗j of the distributions.

Hence, to solve the sign-ambiguity issue, beforethe construction of ΦC , we flip each graph accordingto a reference graph G f (chosen randomly within G)using (9).

The next step involves the rotation of each eigen-vectors matrix according to the same reference graphG f .

2.1.2 Estimating the Eigenvector OrthogonalTransformation

The sign disambiguation process produces a roughrotation which helps to align the eigenvectors of agraph with respect to the eigenvectors of a referencegraph. In order to minimize the variance betweenthe eigenvector matrices of a reference graph (one foreach class) and the eigenvector matrices of the othergraphs, another rotation step is applied. In particu-lar, we are looking for the rotation which minimizethe distance between the nodes of two graphs. Moreformally, we want to maximize the following:

argmaxR ∈O(d)

∏i

P(R x) (10)

where

P(x)∝ ∑j

e−12‖x−x j‖2

h2 (11)

The above formulation of the optimization prob-lem is then applied to our definition of probabil-ity density applying the constraints to a Parzen-Rosenblatt kernel density estimator, obtaining

argmaxR

∏i

∑j

e−12‖R xi−y j‖2

h2 (12)

We subdivide our rotation matrix in two rotationmatrices, namely R (the initial rotation) and S (an ad-ditive rotation). The log-likelihood obtained after theintroduction of the new rotation matrix to equation 12can be written as

L = ∑i

log

(∑

je−

12‖SR xi−y j‖2

h2

)(13)

Let αi j be defined as

αi, j = e−12‖R φiφ

Cj ‖

2

h2 (14)

In order to solve 10, we compute the gradient withrespect to the additive rotation matrix S introducedin 13.

∂L∂Shk

= ∑i

∑ j αi j

(− 1

2

∂Shk‖SR xi−y j‖2

h2

)∑ j αi j

(15)

where

∂Shk‖SR xi− y j‖2 =−2(yi)h(R xi)k (16)

Since they are scalar

∂S =−2y j(R xi)T =−2y jxT

i R T (17)

We can now rewrite 13 as

∂L∂S

=

(∑

i

∑ j αi j1h2 y jxT

i

∑ j αi j

)R T (18)

For the sake of readability, let A be defined as

A = ∑i

∑ j αi j1h2 y jxT

i

∑ j αi j(19)

Since S is an orthogonal rotation matrix, it be-longs to the Lie group O(d). The tangent space atthe identity element of the Lie group is its Lie alge-bra, which is the skew-symmetric matrices space. The

Page 5: A Non-Parametric Spectral Model for Graph Classificationgasparetto/publications/2014-ICPRAM-NonParam... · A Non-Parametric Spectral Model for Graph Classification Andrea Gasparetto,

A

0

10

20

0

10

20

0

10

20

30

40

50

60

70

80

B

0

10

20

0

10

20

0

10

20

30

40

50

60

70

80

C

0

10

20

0

10

20

0

10

20

30

40

50

60

70

80

Figure 1: Example of the computation of the rotation matrix. A) KDE applied to the eigenvectors matrix of the Laplacian of agraph, B) KDE of a synthetically rotated eigenvectors matrix of the same graph, C) show the KDE of the eigenvectors matrixafter the application of the rotation matrix computed using the described method.

skew-symmetric component of a matrix M is given byM−MT

2 .In order to project the gradient to the null space

(to find the maximum), we have to make AR T sym-metric. The rotation matrix R which symmetrize thepreviously computed gradient is obtained through thesingular value decomposition (SVD) of A, svd(A) =ULV T . In particular, we can compute R as

R =UV T (20)

which symmetrize the gradient. Indeed

AR T = (ULV T )(VUT ) =ULUT (21)

which is symmetric. Refer to figure 1 for a graphicalexample of the described process.

To compute the rotation we used the following al-gorithm:

1. The initial value of R is the identity matrix

2. Compute αi j (14) for each i = 1, . . . ,n (where n isthe number of nodes of a graph) and j = 1, . . . ,N(where N is the number of nodes of the model).

3. Compute the matrix A (19)

4. Compute the singular value decomposition of A,svd(A) =ULV T

5. Compute R as R =UV T

6. If the convergence is achieved, i.e. A = AT ,or the maximum number of iterations allowed isreached, end the algorithm, otherwise repeat from2

The maximum number of iterations parameter wasset to 10 for the results showed in section 3.

2.1.3 Estimating the eigenvalue model

Let GC = G1,G2, . . . ,Gm be a set of graphs be-longing to the same class C , and let ΦC

i ,ΛCi , i =

1, . . . ,m, their spectral representation. The diagonalof the eigenvalue matrix ΛC

i contains the eigenvaluesλi

1,λi2, . . . ,λ

id of the i-th graph of the set. Let

ΛC =

diag(λC

1 )diag(λC

2 )...

diag(λCm )

be a m× d matrix containing the firsts d non-zeroeigenvalues of the spectral representation. We assumethat all the j-th eigenvalues of ΛC

i , with j = 1, . . . ,d,are distributed as a log-normal distribution, as shownin 5. We do a maximum likelihood estimate for themodel parameters resulting in:

µ =∑ j lnx j

d, σ

2 =∑ j(lnx j− µ)2

d(22)

2.2 Prediction

Once the models are computed, we can combine themin order to classify a graph which does not belong tothe training set used to compute ΦC ,ΛC . Let G∗ besuch graph. Let Φ∗ and Λ∗ be the spectral decomposi-tion of the Laplacian of G∗. Thanks to the assumption

Page 6: A Non-Parametric Spectral Model for Graph Classificationgasparetto/publications/2014-ICPRAM-NonParam... · A Non-Parametric Spectral Model for Graph Classification Andrea Gasparetto,

of independence between the two models, we can de-fine the prediction as the posterior probability

P(C | G∗) = P(Φ∗ |ΦC )P(Λ∗ | ΛC ) (23)

Once both the above mentioned probabilities arecomputed, i.e. the probabilities with respect to theeigenvector model and to the eigenvalue model, andstill assuming the independence between them, wecan compute the conditional distribution with re-spect to the class C using equation 23. But sinceboth P(Φ∗ | ΦC ) and P(Λ∗ | ΛC ) come from a log-derivation (equation 25 and 26), it can be rewritten as

logP(C | G∗) = `L(Φ∗ |ΦC )+ `L(Λ

∗ | ΛC ) (24)

In particular, the eigenvector model log-likelihoodis defined as

`L(Φ∗|ΘΦ) =

n

∏i=1

P(xi) =n

∑i=1

logP(xi|ΘΦ) (25)

where n is the number of nodes of the graph G∗, whilexi is the row vector containing all the i-th coordinatesof the eigenvector matrix.

The eigenvalue model log-likelihood is defined as

`L(Λ∗|µΘ

i ,σΘi ) =

d

∏i=1

P(λi) =d

∑i=1

logP(λi) (26)

with µΘi and σΘ

i which are the parameters estimatedusing 22.

Finally, a decision rule is applied in order to pre-dict the membership of a graph to a certain class. Inparticular, for this work we classify the graphs assign-ing them to the most probable class (i.e. the class thatyields the higher value).

3 EXPERIMENTAL RESULTS

We now evaluate the proposed model comparing itwith a number of well-known alternative classifica-tion methods. More specifically, we compare ourstructure-based classifier with some popular graphkernels, like the unaligned QJSD kernel [Bai et al.,2013], the Weisfeiler-Lehman kernel [Shervashidzeet al., 2011], the graphlet kernel [Shervashidze et al.,2009], the shortest-path kernel [Borgwardt and peterKriegel, 2005], and the random walk kernel [Kashimaet al., 2003]. Note that for the Weisfeiler-Lehman weset the number of iterations h = 3 and we attributeeach node with its degree.

EmbeddingOdimension

Ave

rage

Oacc

ura

cy

AccuracyOvariationsOoverOembeddingOdimension

3 4 5 6 7 8 9 10 11 12

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

COIL

Mutag

Reeb

PTC

PPI

Figure 2: Average classification accuracy on all the datasetsas we vary the embedding dimension for both the eigenval-ues and eigenvectors matrices.

The experiments were run on the followingdatasets: the PPI dataset, which consists of protein-protein interaction (PPIs) networks related to his-tidine kinase [Jensen et al., 2008] (40 PPIs fromAcidovorax avenae and 46 PPIs from Acidobacte-ria). The PTC (The Predictive Toxicology Chal-lenge) dataset, which records the carcinogenicity ofseveral hundred chemical compounds for male rats(MR), female rats (FR), male mice (MM) and femalemice (FM) [Li et al., 2012] (here we use the 344graphs in the MR class). 3) The COIL dataset, whichconsists of 5 objects from [Nene et al., 1996], eachwith 72 views obtained from equally spaced viewingdirections, where for each view a graph was built bytriangulating the extracted Harris corner points. TheReeb dataset, which consists of a set of adjacency ma-trices associated to the computation of reeb graphs of3D shapes [Biasotti et al., 2003]. Finally, the Mu-tag (Mutagenicity) dataset, which consists of graphsrepresenting 188 chemical compounds, and aims topredict whether each compound possesses mutagenic-ity [Shervashidze et al., 2011]. Since the verticesand edges of each compound are labeled with a realnumber, we transform these graphs into unweightedgraphs.

We use a binary C-SVM to test the efficacy of thekernels. We perform 10-fold cross validation, wherefor each sample we independently tune the value ofC, the SVM regularizer constant, by considering thetraining data from that sample. The process is av-eraged over 100 random partitions of the data, andthe results are reported in terms of average accuracy± standard error. We use a similar approach for thecross validation of our method. We perform a 10-fold cross validation over the datasets, using the pro-posed model. We tested our method using differ-ent numbers of eigenvectors and eigenvalues, which

Page 7: A Non-Parametric Spectral Model for Graph Classificationgasparetto/publications/2014-ICPRAM-NonParam... · A Non-Parametric Spectral Model for Graph Classification Andrea Gasparetto,

Table 1: Classification accuracy (± standard error) on unattributed graph datasets. OUR denotes the proposed model. SAQJSD and QJSU denote the Quantum Jensen-Shannon kernel in the aligned [Torsello et al., 2014] and unaligned [Bai et al.,2013] version, WL is the Weisfeiler-Lehman kernel [Shervashidze et al., 2011], GR denotes the graphlet kernel computedusing all graphlets of size 3 [Shervashidze et al., 2009], SP is the shortest-path kernel [Borgwardt and peter Kriegel, 2005],and RW is the random walk kernel [Kashima et al., 2003]. For each classification method and dataset, the best performanceis highlighted in bold.

Datasets PPI PTC COIL5 Reeb MUTAGOUR 79.60±0.86 76.80±1.52 86.41±0.38 67.36±1.52 87.74±0.47QJSD 68.86±1.00 55.78±0.38 69.83±0.22 35.03±0.26 81.00±0.51

SA QJSD 68.56±0.87 57.07±0.34 69.90±0.22 35.78±0.42 82.11±0.30WL 79.40±0.83 56.86±0.37 29.08±0.57 50.73±0.39 77.94±0.46GR 51.06±1.00 55.70±0.18 66.49±0.25 22.90±0.36 81.05±0.41SP 63.25±0.97 56.32±0.28 69.28±0.42 55.85±0.37 83.36±0.52RW 49.93±0.83 55.78±0.07 11.83±0.17 15.98±0.42 79.61±0.64

can be seen as one of our free parameter. Further-more, we tested the model with different levels of sub-sampling, that is, we sub-sampled all the graphs ofthe datasets (both training and test set) and apply ourclassification method to it.

Fig. 2 shows the average classification accuracy(± standard error) on all the datasets as we vary thenumber of eigenvectors used. As you can see, ev-ery dataset behave differently based on the number ofeigenvectors involved. In particular, for the COIL5dataset, the use of more eigenvectors yields worst re-sults, which means that the eigenvectors associated tothe smaller non-zero eigenvalues of the spectra, mod-els the classes better, while the subsequent ones justadd noise to our representation. In the contrary, theMutag dataset benefits from increasing the numberof eigenvectors (and eigenvalues) involved in the cre-ation of the class model.

Fig.3 shows the average classification accuracy (±standard error) on all the datasets as we vary the per-centage of sub-sampling applied to each graph of eachdataset. In particular, the first accuracy measure cor-responds to the application of our model on the spec-tral decomposition of the graphs where only 10% ofthe nodes were preserved. All the datasets (exceptfor Mutag and PPI datasets) reach worse levels of ac-curacy with a lower number of nodes, meaning thatthe structural information given by each node of themodel is useful for classification purpose. Conversely,the other datasets are more robust to sub-sampling.

Table 1 shows the average classification accuracy(± standard error) of the different kernels and of ourmethod on the selected datasets. The proposed modelyields an increase of the performance with respect tothe confronted graph kernels on all the used datasets.In particular, we obtained similar results with respectto the Weisfeiler-Lehman graph kernel on the PPIdataset. This is probably due to the use of the nodelabels in order to mitigate the localization problemand thus improving node localization in the evalua-

Graph7sampling7percentage

Ave

rage

7acc

ura

cy

Accuracy7variations7after7graph7sub-sampling

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

COIL5

Mutag

Reeb

PTC

PPI

Figure 3: Average classification accuracy (with the inter-val segment representing the ± standard error) on all thedatasets as we vary the percentage of sub-sampling appliedto each graph of each dataset.

tion process. Even though our model does not exploitnode attributes, we were able to outperform all thekernels on all the other datasets.

4 CONCLUSIONS

In this paper we have introduced a novel modelof structural representation based on a spectral de-scription of graphs which lifts the one-to-one node-correspondence assumption and is strongly rooted ina statistical learning framework. We showed how thedefined separate models for eigenvalues and eigen-vectors could be used within a statistical frameworkto address the graphs classification task. We tested thedefined method against a number of alternative graphkernels and we showed its effectiveness in a numberof structural classification tasks.

Page 8: A Non-Parametric Spectral Model for Graph Classificationgasparetto/publications/2014-ICPRAM-NonParam... · A Non-Parametric Spectral Model for Graph Classification Andrea Gasparetto,

TrainingOsetOpercentage

Ave

rage

Oacc

ura

cy

AccuracyOvariationsOoverOtrainingOsetOdimension

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

COIL5

Mutag

Reeb

PTC

PPI

Figure 4: Average classification accuracy (with the inter-val segment representing the ± standard error) on all thedatasets as we vary the percentage of graph of the trainingset used to build the model.

REFERENCES

Aubry, M., Schlickewei, U., and Cremers, D. (2011). Thewave kernel signature: A quantum mechanical ap-proach to shape analysis. In Computer Vision Work-shops (ICCV Workshops), 2011 IEEE InternationalConference on, pages 1626–1633.

Bai, L., Hancock, E., Torsello, A., and Rossi, L. (2013).A quantum jensen-shannon graph kernel using thecontinuous-time quantum walk. In Kropatsch, W.,Artner, N., Haxhimusa, Y., and Jiang, X., editors,Graph-Based Representations in Pattern Recognition,Lecture Notes in Computer Science, pages 121–131.Springer Berlin Heidelberg.

Biasotti, S., Marini, S., Mortara, M., Patan, G., Spagnuolo,M., and Falcidieno, B. (2003). 3d shape matchingthrough topological structures. In Nystrm, I., San-niti di Baja, G., and Svensson, S., editors, DiscreteGeometry for Computer Imagery, volume 2886 ofLecture Notes in Computer Science, pages 194–203.Springer Berlin Heidelberg.

Bonev, B., Escolano, F., Lozano, M., Suau, P., Cazorla, M.,and Aguilar, W. (2007). Constellations and the un-supervised learning of graphs. In Escolano, F. andVento, M., editors, Graph-Based Representations inPattern Recognition, volume 4538 of Lecture Notesin Computer Science, pages 340–350. Springer BerlinHeidelberg.

Borgwardt, K. M. and peter Kriegel, H. (2005). Shortest-path kernels on graphs. In In Proceedings of the 2005International Conference on Data Mining, pages 74–81.

Bunke, H., Foggia, P., Guidobaldi, C., and Vento, M.(2003). Graph clustering using the weighted mini-mum common supergraph. In Hancock, E. and Vento,M., editors, Graph Based Representations in PatternRecognition, volume 2726 of Lecture Notes in Com-puter Science, pages 235–246. Springer Berlin Hei-delberg.

Estrada, F. and Jepson, A. (2009). Benchmarking im-age segmentation algorithms. International journal ofcomputer vision, 85(2):167–181.

Friedman, N. and Koller, D. (2003). Being bayesian aboutnetwork structure. a bayesian approach to structurediscovery in bayesian networks. Machine Learning,50(1-2):95–125.

Jensen, L. J., Kuhn, M., Stark, M., Chaffron, S., Creevey,C., Muller, J., Doerks, T., Roth, E., Simonovic, M.,Bork, P., and Mering, C. V. (2008). String 8 a globalview on proteins and their functional interactions in630 organisms.

Kashima, H., Tsuda, K., and Inokuchi, A. (2003). Marginal-ized kernels between labeled graphs. In Proceedingsof the Twentieth International Conference on MachineLearning, pages 321–328. AAAI Press.

Li, G., Semerci, M., Yener, B., and Zaki, M. J. (2012). Ef-fective graph classification based on topological andlabel attributes. Stat. Anal. Data Min., pages 265–283.

Luo, B., Wilson, R. C., and Hancock, E. R. (2006). Aspectral approach to learning structural variations ingraphs. Pattern Recognition, 39(6):1188 – 1198.

Nene, S. A., Nayar, S. K., and Murase, H. (1996). ColumbiaObject Image Library (COIL-20). Technical report.

Shervashidze, N., Schweitzer, P., van Leeuwen, E. J.,Mehlhorn, K., and Borgwardt, K. M. (2011).Weisfeiler-lehman graph kernels. J. Mach. Learn. Res.

Shervashidze, N., Vishwanathan, S. V. N., Petri, T. H.,Mehlhorn, K., and et al. (2009). Efficient graphlet ker-nels for large graph comparison.

Silverman, B. W. (1986). Density Estimation for Statisticsand Data Analysis. Chapman & Hall, London.

Todorovic, S. and Ahuja, N. (2006). Extracting subimagesof an unknown category from a set of images. InComputer Vision and Pattern Recognition, 2006 IEEEComputer Society Conference on, volume 1, pages927–934.

Torsello, A. (2008). An importance sampling approach tolearning structural representations of shape. In Com-puter Vision and Pattern Recognition, 2008. CVPR2008. IEEE Conference on, pages 1–7.

Torsello, A., Gasparetto, A., Rossi, L., and Hancock, E.(2014). Transitive State Alignment for the QuantumJensen-Shannon Kernel.

Torsello, A. and Hancock, E. (2006). Learning shape-classes using a mixture of tree-unions. Pattern Anal-ysis and Machine Intelligence, IEEE Transactions on,28(6):954–967.

Torsello, A. and Rossi, L. (2011). Supervised learning ofgraph structure. In Pelillo, M. and Hancock, E. R.,editors, SIMBAD, volume 7005 of Lecture Notes inComputer Science, pages 117–132. Springer.

White, D. and Wilson, R. (2007). Spectral generative mod-els for graphs. In Image Analysis and Processing,2007. ICIAP 2007. 14th International Conference on,pages 35–42.