Exploiting Edge Features in Graph Neural Networks

Exploiting Edge Features in Graph Neural NetworksA Blogpost Discussing the Paper ' ' by Exploiting Edge Features in Graph Neural Networks Liyu Gong, Qiang Ch

presented for CVPR 2019.eng

IntroductionGraph LearningMedical ApplicationsGraph Neural Networks

Related WorkGraph Convolutional NetworkGraph Attention Network

Step 1: Linear TransformationStep 2: Computation of Attention CoefficientsStep 3: Normalization of Attention CoefficientsStep 4: Computation of Final Output FeaturesStep 5: Computation of Multiple Attention Mechanisms

MethodologyGraph Structure Proposed Network ArchitectureDoubly Stochastic Normalization of EdgesNetwork Layers

Attention based EGNN layer - EGNN(A)Convolution based EGNN layer - EGNN(C)

ExperimentDataset

Molecular AnalysisExperimental Setup

ResultsCitation NetworksMolecular Analysis

ConclusionLimitation

References

Introduction

Graphs are an often overlooked Data Structure when we think of learning from Data. This is due to the difficulty in learning from the very complex structures of Graphs. However, recent scientific publications have made significant inroads into this field.

Graphs

Graphs are Data Structures that consists of vertices/nodes joined by edges. The Edges can be non-/directional or non-/weighted. Both the Edges and Nodes can have features associated with them.

Fig 1: An Example Graph

Fig 2: Visual Representation of the Cora Dataset as a Graph [3]

Graph Learning

Graph Learning is important as many systems of interactions in the natural world such as Molecular Structures, Biological Protein Networks and many others can be easily modeled using Graphs. Graphs have an advantage over the conventional tabular representation of data (Euclidean format) in that they can be very compact as the resulting matrix would be sparse in nature which leads to making the learing process from these tables cumbersome.

Fig 3: Example Graph Problem of Edge Classification

Medical Applications

1.

2.

Disease Prediction [5] [7]: Patients are represented as nodes with features as associated medical images of the patients. The Network learns the probability of a certain disease appearing in an individual.Link Prediction [6] : The Network learns the characteristics features of relevant edges in Drug-Disease prediction, Protein-Protein interaction and Drug-Drug interactions.

Graph Neural Networks

These are networks that directly operate on the graph structure. are typically used for the Graph Neural Networks purposes of node classification, edge classification or a classification of the network as a whole. The typical Graph

learns from two matrices that are constructed from the underlying graph.Neural Network

Feature Descripton Matrix (H): Aggregate of features from all nodes of the network.

Adjacency Matrix (A): Encodes the Structure of Edges.

A Typical layer of a can be compactly represented by the following mathematical expression:Graph Neural Network

Adjacency Matrix

This matrix compactly represents the structure of the Graph i.e it denotes which nodes are connected and which are not. An example is presented below.

Fig 4: A Graph Structure with five nodes with non-directional, non-weighted edges

The is binary with denoting the presence of a edge between the nodes that are given by the Adjacency Matrix 1respective matrix indices. A denotes the absence of an edge. Each node is connected to itself. The matrix is 0binary as the graph under consideration is non-directional and non-weighted.

Related Work

In order to fully understand the contributions of the paper in discussion, we have to take a look back at the pre-existing methods.

Graph Convolutional Network

http://wiki.tum.de#ref5

This Network is directly inspired from the The idea is similar in that we Convolutional Neural Network (CNN). transfrom the relevant information from the Spatial domain to the Spectral domain, convolve with the filter and then transform the result back to the spatial domain. The calculation begins with the computation of the Graph Laplacian.

W is the Weighted matrix and D is the Diagonal Degree matrix .

This Laplacian matrix is diagonalizable by the Fourier basis such that where and the are the eigenvalues and orthonormal eigenvectors of L. This gives the Graph fourier transfrom of as and the inverse as .

This gives us the following representation for for a node vector being filtered by a filter ( ).

Drawbacks:

GCN requires the knowledge of the complete graph for performing the convolution operation. This is due to the fact that all the Eigenvalues() of the graph structure has to be calculated for this operation which is very expensive computationally.New additions to the graph requires recalculation of the entire architecture parameters due to change in Eigenvalues() .Cannot transfer knowledge from one domain to another. (Transductive Learning)

Graph Attention Network

In order to overcome the drawbacks of the , the was proposed. It is a non-spectral approach in that the GCN GATconvolutions are performed directly on spatially close neighbours. is a mechanism by which the network Attentionfinds out how infuential the features of the neighbouring nodes are in learning the feature of the node under consideration. The basic building block of the is the . The following sections explain how GAT Graph Attention Layerthe works.GAT

Fig 5: A Example Graph fed into a Graph Attention Layer

Here is a feature vector of length F.

Step 1: Linear Transformation

The first step performed by the is to apply a linear transformation - to Graph Attention Layer Weighted matrixthe feature vectors of the nodes.

Step 2: Computation of Attention Coefficients

Attention Coefficients determine the relative importance of neighbouring features with respect to each other. They are calculated using the formula: . Here a is function that we determine subject to the restriction that . and are neighbouring nodes.

Fig 6: Individual Steps in calculation of attention coefficients for each pair of Neighbouring nodes

As we observe in the series of images above, we first calculate the self-attention coeficient and then compute attention coefficents with all of the neighbours.

Step 3: Normalization of Attention Coefficients

Due to the varied strcuture of graphs, nodes can have a different number of neighbours. In order to have a common scaling acroos all neighbourhoods, the are Normalized. Attention coefficients

Here N is the Neighbouhood of node .

Step 4: Computation of Final Output Features

Now we compute the learned features of nodes. is a Non-Linear Transformation.

Fig 7: Example Network Architecture

Step 5: Computation of Multiple Attention Mechanisms

In order to improve the stability of the learning process is employed. We compute multiple Multi-head attention different attention maps and finally aggregate all the learned representations.

K denotes the number of independant attention maps used.

Fig 8: Overall look at all operations involved in the update of learned features of nodes [3]

Methodology

Graph Structure

This Paper presents a Network Architecture that can handle a much more complex graph like shown below with all features being real-valued.

Fig 9: Representation of Graph Structure to be learned

We see that the graph now has edges which are multi-dimensional with features baing able to be real-valued.

Proposed Network Architecture

The paper proposes a novel architecture that along with learing from node feature also learns from edge features. The proposed network is named the .Edge Enhanced Graph Neural Network (EGNN)

Fig 10: Comparison of Network Architectures. The left one is the traditional GAT and the right one being the proposal [1]

Doubly Stochastic Normalization of Edges

The edge features matrix has to be normalised after each layer of the graph network as it is used as the filter in the convolutional operation with the node features matrix. The authors present the idea of using the Doubly Stochastic

over the conventional of the and the of Normalization of Edges Row Normalization GAT Symmetric Normalizationthe . GCN

Here each edge has features and deotes the raw feature of edge connectining nodes and . denotes the normalized feature.

Network Layers

In contrast to the above discussed which handles only one-dimensional binary valued edge features, the GAT Edge can handle multi-dimensional real-valued features. The Enhanced Graph Neural Network Edge Enhanced Graph

can be constructed with two main type of layers - layer or the Neural Network Attention based Convolution basedlayer.

Attention based EGNN layer - EGNN(A)

This layer is designed as an extension from the layer but is designed to handle multi-dimensional real-valued GATfeatures. The Layer is described using the following Mathematical Representation.

P is the total number of features for an edge. the calculated attention coefficient.

In this experiment, where is a Parameter matrix. Notice the similarity in calculation with the G. The main point of difference being that Edge features are also used in calcualtion of attention coefficients.AT

Attention Coefficient Calculation

L - LeakyReLU ; W - Parameter Matrix ; || - Concatenation Operation ; DS - Doubly Stochastic Normalization Operator

Update of Edge Features

Convolution based EGNN layer - EGNN(C)

The Graph Covolution Operation is a special case of the Graph Attention Operation. It directly utilises the edge feature matrix.

Edge Features for Directed Graph

The proposed network architecture also takes into account the directional nature of the graph by expanding the number of edge channels to three. Now each is expanded to . The three channels denote the type of neighbourhood - forward, backward and undirected.

Experiment

Dataset

Citation Networks

The Citation Network Datasets are all directed graphs with edge directions representing the directions of citations. The nodes i.e Papers can be classified into many labels. The problem being solved is that of multi-label classification. The Datasets used are . dataset has very high class imbalance Cora, Citeseer and Pubmed Corathereby necessitating the use of weighted loss functions.

Cora Citeseer Pubmed

#Nodes 2708 3327 19717

#Edges 5429 4732 44338

#Node Features 1433 1433 3703

#Classes 7 6 3

Fig 11: Class Distribution of Citation Network Datasets [1]

Molecular Analysis

A Molecule can be represented as a graph with the nodes being atoms and the edges representing chemical bonds. The Datasets used are . dataset is associated with the problem of Tox21, Lipophilicity and Freesolv Tox21network classification while the other two are associated with the problem of whole graph regression.

Tox21 Lipophilicity Freesolv

#Compounds 7831 4200 642

#Node Features 25 25 25

#Edge Features 42 21 25

Experimental Setup

Each of the Three Datasets is split into three subsets – training, validation and testing.The Authors used two splitting of Data for Citation Networks.

Sparse: 5%, 15% and 80% for training, validation and test

Dense: 60%, 20% and 20% as aboveData Split for Molecular Analysis - training(80%), validation(10%) and testing(10%).2 EGNN layers for all experiments.The Non-linear activation employed is the Exponential Linear unit.

Tox21 Lipophilicity Freesolv

Loss Function Sigmoid Cross-Entropy Mean-squared Loss Mean-squared Loss

Results

Citation Networks

List of Network Architecutes used for Testing

GCN: baselineGAT: baselineEGNN(C): Full EGNN(C) with doubly stochastic normalization and multi-dimensional edge features.EGNN(A): Full EGNN(A) with all functionalities.All architectures with implies the use of a weighted loss function*-D: No doubly stochastic normalization-M: Ignores multi-dimensional edge Features-A: No adaptiveness across layers

Fig 12: Table of Results obtained from Citation Networks Experiment [1]

Fig 1: Charts of Results for Citation Networks

Observations:

Almost all EGNN variations perform better than the baselines. This indicates that the ideas do have merit. The Weighted Loss function improves the baseline and performance on the Cora dataset considerably for both Sparse and Dense GAT GCNsplittings.The non-weighted loss function variants of EGNN perform as good as the weighted loss function variants on the Cora dataset. This indicates that our additions somehow balance out the class imbalances inherently present.The addition of weighted loss functions to the more balanced datasets does not lead to an improvement in performance.Performance on Dense splittings of data are much better than on the sparse splits. This is to be expected as the network has more data to learn from in the Dense split.The network variation with all of the ideas present does not perform the best in most cases. This indicates that certain ideas do not contribute that much in learning certain datasets.

Molecular Analysis

Fig 13: Table of Results obtained from Molecular Analysis Experiment* [1]

Fig 14: Charts of Results for Molecular Analysis

The Baseline models used were and (Specialized ). and perform Random Forest Weave GCN EGNN(A) EGNN(C)similarly on all three datasets. The greatest improvement over the baseline is observed in the regression taks.

* AUC stands for Area Under Curve. Higher indicates better classification. RMSE stands for Root Mean Squared Error.

Conclusion

The authors have been able to learn from Complex Graphs without discarding Information unlike previous papers which were unable to handle the full complexity of Graph networks. The proposals of multi-dimensional real-

edge features, across layers, and inclusion ofvalued adapting edge features doubly stochastic normalization will help us in learning more complex graph architecuresdirected graphs .

Limitation

Difficult to Perform ‘Deep Learning’

Due to the Nature of Graphs and the Proposed Methodology each new layer added to the Network will increase the number of calculations needed exponentially as we need to keep updating the neighbours of neighbours.

Fig 15: Limitation of Graph Learning

This is not an issue if the hardware setup has enough memory to fit the entirety of the graph in memory.

References

Liyu [1] Gong, Qiang Cheng. Exploiting Edge Features in Graph Neural Networks. In Computer Vision and Pattern , 2019. Recognition

[2] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances

, pages 3844–3852. Curran Associates, Inc., 2016.in Neural Information Processing Systems

[3] P. Velickovic, G. Cucurull, A. Casanova, and A. Romero. Graph Attention Networks. In International Conference , 2018. on Learning Representations

[4] T. N. Kipf and M. Welling. Semi-Supervised Classification with Graph Convolutional Networks. In International , 2017.Conference on Learning Representations

[5] Sarah Parisot, Sofia Ira Ktena, Enzo Ferrante, Matthew Lee, Ricardo Guerrero, Ben Glocker, Daniel Rueckert. Disease Prediction using Graph Convolutional Networks: Application to Autism Spectrum Disorder and Alzheimer's Disease. In , 2018. Medical Image Analysis

[6] Xiang Yue, Zhen Wang, Jingong Huang, Srinivasan Parthasarathy, Soheil Moosavinasab, Yungui Huang, Simon M Lin, Wen Zhang, Ping Zhang, Huan Sun. Graph embedding on biomedical networks: methods, applications and evaluations, 2019. CoRR

https://arxiv.org/abs/1809.02709






Xi Zhang,Lifang He,Kun Chen, Yuan Luo, Jiayu Zhou, and Fei Wang. Multi-View Graph Convolutional Network [7]and Its Applications on Neuroimage Analysis for Parkinson’s Disease. In , AMIA Annual Symposium Proceedings2018.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6371363/

Documents

Exploiting Edge Features in Graph Neural Networks