Evaluating Autoencoder Methods for Building a Molecule Graph …cs229.stanford.edu/proj2020spr/report/Woodward.pdf · 2020-06-14 · They encode SMILES strings to a latent space using

Evaluating Autoencoder Methods forBuilding a Molecule Graph Autoencoder

Amelia Woodward {ameliawd}CS229 Project (Spring 2020)

General Machine Learning/Physical SciencesMentors: Keiran Thompson1 and Todd Martinez2

The Martinez Group, Stanford Department of Chemistry and SLAC PULSE Institute

Abstract— We wish to build an autoencoder formolecules using molecule graphs (Molgraphs) as input.To begin developing this, we assess existing graphicalautoencoding architectures for the purpose. By testing onorganic molecules from the GDB13 database, we find thatKipf and Welling’s VGAE model is the most promisingmodel for development and can reconstruct some simpleorganic molecules before a large hyperparameter search.We will continue to explore VGAE as well as exploreadjusting this encoder and decoder architecture goingforward.

I. BACKGROUND

Motivation. Automated molecular design is highlydesired to accelerate chemical innovation acrossindustries, from pharmaceutical discovery to materialsdesign. The Martinez Group is currently developingan automated synthesis planner for this purpose whichuses graphs of molecules. From hereon, let us callthese molecule graphs ’Molgraphs’. For use in thesynthesis planner and beyond, we want to build anautoencoder for Molgraphs (Figure 2).

This paper evaluates existing autoencodingtechniques as applied to the task of autoencodingMolgraphs. Particularly, we implement existinggraphical autoencoder deisgns and evaluate their graphdecoder architectures. Since one can never separate theloss function from the network architecture, we alsoanalyze reconstruction loss used. This work acts as theinitial step to guide our development of an effectiveautoencoder for Molgraphs.

*Many thanks to the Martinez Group especially Keiran and Toddfor their ongoing support and for access to GPUs. Thank you alsoto Stefan Seritan for his vital help with Docker container woes.

1 Keiran Thompson is a lead Research Associate in the MartinezGroup whose impact on this project has been tremendous.

2 Todd Martinez is the the head of the Martinez Group and ahuge inspiration.

Fig. 1. Molgraph Autoenconder Design Overview [1]

Autoencoders. Broadly speaking, an autoencoder isa feedforward neural network which, when trainedaccurately, should return the same input as output.Autoencoders have an encoder and a decoder.The encoder maps the input to a latent space: avectorized representation of its input. Depending onthe encoding method, the latent space may be discreteor continuous. If the latent space is continuous, we callthis a variational autoencoder (VAE). The decodermaps the latent space to an output, which we desireis the same as the input. In the context of encodinga Molgraph, this means that ideally it takes in aMolgraph and returns the same Molgraph (Figure 2).

Autoencoders, and particularly VAEs, are usefulin that their latent space may be used for downstreammachine learning tasks. In our context such tasks

include:(1) Link prediction between reactions: Link predictionin the context of synthesis planning means findingpossible new reaction pathways in a graphical web ofknown reactions. This requires having some vectorizedrepresentation of Molgraphs.

(2) Optimization for desired chemical properties.Imagine there is some property you wish to optimizein a drug’s design (perhaps you wish for it to havestrong binding affinity to a particular protein, orto be highly polar in some area). In this scenario,you would encode a known molecule (in Molgraphform) that has similar properties to that which wedesire. You would then optimize the latent space forthe particular property you desire, and then decodeback to a Molgraph form, possibly uncovering a newmolecule closer to the desired property.

Autoencoding Molecules. Given the wide-ranging uses of molecule autoencoders, they havebeen a topic of open and expanding research anddevelopment. Notably, the Aspuru-Guzik group presenta chemical VAE which accepts SMILES strings: a textrepresentation of molecules with specific rules (e.g.carbon dioxide = ’O=C=O’ and ethane = ’CC’)[2,3].They encode SMILES strings to a latent space using avariational autoencoder, and decode back to SMILESstrings. While this text autoencoder is a step forward,the SMILES format misses rich spatial informationabout molecules that may be captured by insteadfeeding autoencoders with graphical representationsof molecules [4]. This is because molecule graphsexplicitly take into account bonding and connectivityinformation. In general, nodes in a graph can representatoms and edges can represent bonding informationbetween atoms. Node and edge features can beadjusted to account for known chemical information,both at the atomic and bonding level. We expand onthe particular design of Molgraphs in the Methodssection.

Autoencoding Graphs While research into graph-ical deep learning has exponentialized, there remainchallenges in autoencoding graphs, particularly in thedecoding step. Specific to VAEs, it is challenging totake a discrete graphical structure, encode it into acontinuous latent space, then accurately decode backto a discretized graphical structure.

Tackling this, Kipf and Welling pioneer a technique

Fig. 2. Molgraph Preprocessing [7]

for variationally encoding graphs, called VGAE, ini-tially demonstrating its use in large-graph link pre-diction tasks, and see improvements compared witha discretized graphical autoencoder (GAE) model[5].They also see improvements on previously developedembedding techniques including spectral clustering andDeepWalk [5]. In addition, Simonovsky and Komadakismake suggestions for adjustments to this structure spe-cific to small graphs, coining their method GraphVAE[6].

II. METHODS AND COMPONENTS

In summary, we implement GAE, VGAE and takeinspiration from graphVAE and our knowledge of smallmolecule properties to design an adapted graphVAE.After preprocessing molecules to Molgraphs, We trainthe models on subdatasets taken from the publiclyavailable GDB13 database. We first pre-processthese into Molgraphs. We evaluate the model’sdecoding abilities to reconstruct the data by comparingreconstruction loss, L2-loss and average graph editdistance. We then draw conclusions about how eachalgorithm operates and where to take the direction ofdevelopment of an effective autoencoder.

Molgraphs. We define Molgraphs to be a graphG = (V,E) of a molecule with V being atoms in themolecule and E being edges representing bonds. Fornow there are no edge attributes, though these may(and will likely be) added when building a more fine-tuned model. For now, we wish to build the simplestgraphical model of a molecule possible. Edges aremarked to exist if the covalent distance between atomsis less than some purpose-defined threshold (α). In thestandard case this would be ≈ 1.5 atomic radii, but

there may be cases when we would want this to be ahigher threshold (ie. when intramolecular propertiesmay be particularly important, so this is adjustable).In this paper we restrict to using Molgraphs withα = 1.5. We define what we believe to be the simplestmeaningful node attributes: a one-hot encoding ofatoms where each index in the array corresponds to aparticular atomic number (e.g. Hydrogen ’H’(atomicnumber=1) has a 1 in the 0th index, Carbon ’C’(atomic number=6) has a 1 in the 5th index).We use Networkx and RDKit to construct theMolgraphs from a SMILES (text representation) inputof the molecule.

Autoencoding Methods

VGAE and GAE. [5] First we implement theVGAE architecture presented by Kipf and Welling[5]. Given a graph G = (V,E) with N = |V | nodes,we can construct an adjacency matrix A ∈ Rn×n anddiagonal degree matrix D ∈ Rn×n. Node features aregiven in a N ∈ Rn×d matrix, where d is the lengthof the node featurization vectors. Latent variables aresummarized in matrix Z ∈ Rn×f , where f is thenumber of channels.

Kipf and Welling then define an inference modelfor Z given X and A parameterized by a two-layergraphical convolutional network (GCN). Specifically,

q(Z|X,A) = ΠNi=1q(Zi|X,A),

q(Zi|X,A) = N(zi|µi, (σ2))

The two-layer GCN is

GCN(X,A) = A(AXW0)W1

with weight matrices to be updated: W1,W2. Here

A = D−1

2AD−1

2 ,

µ = GCNµ(X,A),

log σ = GCNσ(X,A).

Here ReLU(.) = max(0, .).

The generative model takes a simple inner productbetween latent variables:

p(A|Z) = ΠNi=1Π

Nj=1p(Aij |zi, zj),

p(Aij |zi, zj) = (zTi zj).

Fig. 3. GAE/VGAE Architecture Overview

Here σ(.) is the sigmoid function. To learn themodel, they optimize the variational lower bound Lwith respect to weight matrices W1,W2:

Eq(z|X,A)[log(p(A|Z)]−KL(q(Z|X,A)||p(Z)).

Here, KL[p(.)||q(.)]] is the Kullback-Leibler diver-gence between p(.) and q(.) and

p(Z) = ΠNi=1N(0, I)

is the Gaussian prior. The difference between GAEand VGAE is that for the GAE, instead of A, theycalculate A = σ(ZZT ) with Z = GCN(X,A) andwe do not include the KL loss in the learning step.

Adapting graphVAE. [7] The main difference be-tween graphVAE and Kipf and Welling’s GVAE is thatgraphVAE makes a restriction on the size of the graph(a maximimum of k nodes, where n is the originalnumber of nodes and n ≤ k) and output a probabilisticfully connected graph of the form G = (A, E, F ).Here, A ∈ Rk×k where node probabilities exist alongthe diagonal and edge probabilities along the off-diagonal. E ∈ Rn×n×.e is the edge attributes and nodeattributes are in the matrix F ∈ Rn×d. In this paperwe will just adapt A and F , since we are currentlynot assigning any edge attributes to Molgraphs, and sofrom hereon consider our adapted version.

In the decoding step, instead of using only an innerproduct decoder like Kipf and Welling, they use amulti-layer perceptron with a sigmoid activation func-tion outputting A. We implement a fully connectedlayer with soft-max activation functions outputting F .We use a binary assignment matrix X ∈ Rk×n for

Fig. 4. graphVAE Architecture Overview

graph matching, using the Hungarian Algorithm [8],and then calculate a loss function:

−logp(G|z) = −λA log p(A′|z)− λF log p(F ′|z)

Where λ’s are all set to 1, A′ = XAXT and F ′ =XT F .

log(A′|z) =1

k

∑i

A′a,a log Aa,a+(1−A′a,a) log 1− Aa,a)

+1

k(k − 1)

∑a6=b

A′a,b log Aa,b + (1−A′a,b) log 1− Aa,b),

log p(F |z) =1

n

∑i

logF Ti F′,

DatasetWe use the open-source GDB13 Database, availablefrom the University of Bern Department of Chemistry[7]. GDB13 is a multi-million molecule datasetcontaining only feasible organic molecules. Themolecules in the database contain up to 13 atoms ofC (Carbon), O (Oxygen), N (Nitrogen), S (Sulfur) andCl (Chlorine). This means these molecules often havemore than 13 atoms as H (hydrogen) is not includedin the atom count to 13. These molecule are writtenin SMILES (text) format, so we pre-process them toMolgraphs and then to appropriate tensors for deeplearning.We extract subsets of molecules with particularsize properties from this dataset in order to beginevaluating and understanding how well training withgraph autoencoders on Molgraphs works. The detailsof these subsets are specifically explained in theExperiments section.

MetricsWe test on three main metrics: reconstruction loss

(during training and testing), L2 loss betweenadjacency matrices and graph edit distance.

(1) The reconstruction loss is directly calculatedduring training, validation and testing from the lossfunctions defined in the Methods section. (Note thatthis means we should not compare reconstruction lossbetween models, but rather between traininc instancesof the same model type).

(2) L2-loss: Since we have different reconstructionloss functions between models, we would like someconsistent way of comparing the quality of decodings.Where A is the input adjacency matrix and A is thedecoded adjacency matrix, we first apply a thresholdto the output adjacency matrix A such that if in row i

and column j, aij <(mean(A)+median(A))

2 , then aij isset to 0. We apply dynamic thresholding because thedirect output of A is continuously probabilistic and weneed some way to discretize the adjacency matrix. Letus call this thresholded A to be Athresh. Then over atest set of size n, we calculate:

lossL2 =1

n

n∑i=1

||An − Athreshn||2.

(3) Average Graph Edit Distance (GED): GED isa measure of similarity between graphs used in graphtheory. Its formula for comparison between two graphsis:

GED(g1, g2) = mine1,e2,...,ek∈P(g1,g2)

k∑i=1

c(ei)

where P(g1, g2) is the set of all edit paths transformingg1 into g2 and c(e) calculates the cost of each graphedit operation. We then naively create an average graphedit distance in the test set of

GEDaverage =1

nGEDn.

We don’t believe this is a perfect metric: somemolecules, particularly vastly larger molecules arelikely to have much larger GEDs than a smallermolecule that is coded incorrectly, so it is difficultto determine what is actually going on using thismetric. However, it is a standard graph similarity metricnonetheless, so we at least attempt to include it. In ad-dition we will examine how the models code particularmolecules of interest (e.g. straight chained vs ringedstructures) in attempting to understand what is goingon.

Fig. 5. Experiment 1 Results

Fig. 6. Experiment 2 Results

III. EXPERIMENTS

We conduct the following experiments to betterunderstand how the models work and what is requiredfor them to train well. Note that at this stage, we reallydon’t expect the models to perform stellarly. Insteadwe are looking for the best step forward to effectivelybuild a Molgraph autoencoder.

Experiment 1. We first attempt to understand howthe models train on increasingly large molecules.We use three training/val/test subsets extracted fromGDB13, each with (Train/Val/Test: 10K/1K/1K):1. Up to 4 Carbon Molecules2. Up to 6 Carbon Molecules3. Up to 8 Carbon Molecules

We train on each of GAE, VGAE and adaptedgraphVAE. For the hyperparameters,we set the number of channels in the latent spaceto 16. The number of epochs is 400. The learning

Fig. 7. Training: Reconstruction per Epoch, Experiment 2;16 Channel, 2 Conv Experiment, VGAE training had learning rate1e-3, graphVAE training had learning rate 1e-4

Fig. 8. VGAE Reconstructions of Simple Molecules

rate is 0.01 for GAE/VGAE (which we have deemedappropriate from pre-trials) and the learning rate is0.001 for the adapted graphVAE. We achieve theresults seen in Fig 5.

Experiment 2. After analyzing small datasets inExperiment 1, we tried increasing the dataset size to100K molecules. We restricted these experiments to20 epochs. We then tried some simple hyperparameterchanges which we thought might improve the modelperformance.

We tried using 16 and 32 channels in the latent space.We also tried keeping 2 graphical convolutional layers,and also adding a 3rd convolutional layer.

Plotting the reconstruction losses during trainingwe saw that each model is converging, in bothExperiments 1 and 2. GAE and VGAE take muchlonger to converge and need more than 20 epochs,and achieved much better results after the increase inepochs in Experiment 1. Therefore, going forward wewill try using both the larger dataset and more epochswhen training GAE/VGAE.

While GAE/VGAE had smooth reconstruction losscurves during training, graphVAE steeply decreased itsreconstruction loss before plateauing after the secondepoch. This same trend appeared in Experiment 2, evenwhen we reduced the learning rate to 1e-4 (Figure 7).This means that we are likely in some way saturatingthe model and need to try extensive hyperparametertuning (and potentially debugging) to improve the out-comes of this model.

Autoencoding a variety of standard organicmolecules using the trained models, we found thatthe model that worked best was VGAE trainedon 400 epochs for the up to 4 and 6 Carbonsdatasets (see Figure 8). (Actually GAE workedslightly better than VGAE with the same dataset andtrainig parameters, but since we care about findingcontinuous representations of molecules, then we arereally wishing to compare VGAE and the adaptedgraphVAE). We can effectively reconstruct some ofthe most simple molecules such as water and methane.We noticed that almost when the graph outputs areincorrect, there is a tendency for the outputs to beoverly-connected graphs, and want to delve moreinto why this is the case and possible decoding orthresholding techniques that might help to alleviatethis. This also explains why the average graph editdistance remains so high in Fig 5 and 6: if graphs arevery highly connected like benzene in figure 8, thenthe graph edit distance of arbitrary test molecules islikely to remain high.

We did not see a significant improvement in out-comes by increasing the number of channels or addinga convolutional layer. However, we would like to trythis experiment again on VGAE with more epochs oftraining to come to a more informed conclusion on thematter.

IV. OUTCOMES AND NEXT STEPS

We come to the following main conclusions andoutcomes towards the development of a molgraphVAE.VGAE appears to be more flexible and have signifi-cantly better Molgraph reconstruction outcomes (evenwhen not fully converged) than adapted graphVAE. Wewill consider adding node and edge attribute featuresto this model to explore further development.

Positively, all models are learning during training inthat their reconstruction loss is decreasing. To optimize,we will conduct comprehensive hyperparameter searchto determine optimal learning rates, explore addingmany more convolutional layers. Especially, we willuncover why graphVAE saturates after 1-2 epochs (maylead to significantly improved results).

All models tend to over-draw edges and requirehigh thresholds in output adjacency matrix to recover‘molecule-like’ outputs. We should explore why thisoccurs and what thresholding techniques make the mostsense with these models.

We hypothesize that adding edge weights containingbond order information may drastically help the modellearn. For instance bond order information containstells the model whether a carbon should be connectingto 2,3 or 4 other atoms. Therefore, we will also trytraining with bond orders.

REFERENCES

[1] Wang, L., Titov, A., McGibbon, R., Liu, F., Pande, V.,Martínez, T. (2014). Discovering chemistry with an ab ini-tio nanoreactor. Nature Chemistry, 6(12), 1044-1048. doi:10.1038/nchem.2099.

[2] Gomez-Bombarelli, R. et al. (2018). Automatic ChemicalDesign Using a Data-Driven Continuous Representation ofMolecules. ACS Central Science, 4(2), pp.268-276.

[3] https://www.daylight.com/dayhtml/doc/theory/theory.smiles.html[4] Woodward, A. (2019) Machine Learning on Chemical Re-

action Networks: Summer Research Update. (My summerresearch last year)

[5] Kipf and Welling. (2016). Variational Graph Autoencoders.arXiv:1611.07308v1

[6] Simonovsky and Komadakis.(2018) graphVAE: Towards theGeneration of Small Graphs Using Variational Autoen-coders.arXiv:1802.03480v1

[7] Reymond Research Group. (2007) GDB Database. Universityof Bern http://gdb.unibe.ch/

[8] Peng, R. (2015) Hungarian Algorithm.https://www.cc.gatech.edu/ rpeng/ 8434S15 hungarianAlgo-rithm.pdf

V. RESOURCESDeep Learning We used Pytorch and the Pytorch Geometriclibrary, as well as our own implementations of networkarchitectures to process Molgraph data and implement GAE,VGAE and build adapted graphVAE from scratch.

Documents

Evaluating Autoencoder Methods for Building a Molecule Graph …cs229.stanford.edu/proj2020spr/report/Woodward.pdf · 2020-06-14 · They encode SMILES strings to a latent space using