Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Supporting Information
A Universal Deep Learning Framework based
on Graph Neural Network for Virtual Co-
Crystal Screening
Yuanyuan Jiang a, Jiali Guo a, Yijing Liu b, Yanzhi Guo a,
Menglong Lia, Xuemei Pu a,*
a College of Chemistry, Sichuan University, Chengdu, 610064
b College of Computer Science, Sichuan University, Chengdu, 610064
* Corresponding Author
Xuemei Pu ([email protected])
1. Construction of several machine learning models as the controls
DNN constructed by our work contains 6 full-connected layers, as shown by
Figure S1. Excepting for the final output layer, batch normalization1 and ReLU2 are
applied in each layer.
Figure S1. Architecture of DNN.
NCI1 is a spatial-based graph convolution network from Felipe et al3, where key
components are three Graph-CNN layers and two Graph Embedding Pooling (GEP)
layers, as depicted by left of Figure S2. The two kinds of layers perform the message
passing phase. The readout phase is a flattening operation. Details of Graph-CNN see
Methods. Here we mainly introduce the GEP layer as shown in right of Figure S1.
Figure S2. Architecture of NCI1.
Like pooling layers in conventional CNNs, the GEP layer is used to reduce
dimensions of the input, which eliminates redundant information and also improves
performance of computation. GEP transfers a graph with node number 𝑁 to a given
number 𝑁′. For this purpose, an embedding matrix 𝑿𝑒𝑚𝑏 ∈ ℝ𝑁×𝑁′ is produced by a
filter tensor 𝑯𝑒𝑚𝑏 ∈ ℝ𝑁×𝑁×𝐶×𝑁′. The calculation of 𝑿𝑒𝑚𝑏 is similar to the multiple
filters Graph-CNN (vide Methods), where the learnable filter 𝑯𝑒𝑚𝑏 is multiplied by
the node features 𝑿𝑖𝑛. It is defined by equations (S1-S2):
𝑿𝑒𝑚𝑏
(𝑛′)= ∑ 𝑯𝑒𝑚𝑏
(𝑐,𝑛′)𝑿𝑖𝑛
(𝑐)+ 𝑏
𝐶
𝑐=1
(S1)
𝑿𝑒𝑚𝑏 = softmax(GConv𝑒𝑚𝑏(𝑿𝑖𝑛, 𝑁′) + 𝒃) (S2)
where 𝑯𝑒𝑚𝑏(𝑐,𝑛′)
∈ ℝ𝑁×𝑁 is a part of 𝑯𝑒𝑚𝑏. 𝑿𝑒𝑚𝑏(𝑛′)
is a column of 𝑿𝑒𝑚𝑏 ∈ ℝ𝑁×𝑁′. The
pooled graph data will be calculated by the next operations (vide equations (S3-S4)).
𝑿𝑜𝑢𝑡 = 𝑿𝑒𝑚𝑏𝑇 𝑿𝑖𝑛 (S3)
𝑨𝑜𝑢𝑡 = 𝑿𝑒𝑚𝑏𝑇 𝑨𝑖𝑛𝑿𝑒𝑚𝑏 (S4)
where 𝑨𝑖𝑛 ∈ ℝ𝑁×𝑁 is adjacency matrix, 𝑨𝑜𝑢𝑡 ∈ ℝ𝑁′×𝑁′ is pooled adjacency matrix.
𝑿𝑜𝑢𝑡 ∈ ℝ𝑁′×𝐶 is pooled node feature matrix. Finally, a pooled graph that has 𝑨𝑜𝑢𝑡
and 𝑿𝑜𝑢𝑡 is produced by GEP.
enn-s2s, proposed by Gilmer et al4, has two phases (a message passing phase and
a readout phase, as shown in Figure S3. In Gilmer’s work, enn-s2s is a regression
model. Here, in order to extend its application to the classification prediction of
cocrystal formation, we modified the architecture of enn-s2s by changing the dimension
of output layer. The message passing phase includes two functions, i.e., message
passing function and update function. The message passing function is used to
propagate node features, as reflected by equation (S5).
𝒙𝑖𝑡 = 𝑾𝒙𝑖
𝑡−1 + ∑ 𝒙𝑗𝑡−1 ∙ 𝐌𝐋𝐏(𝒆𝑖,𝑗)
𝑗∈𝒩(𝑖)
(S5)
Where 𝒙𝑖𝑡 is the feature of node i in t-th time step, W is trainable weights, 𝒩(𝑖) is
the adjacent nodes of node i, 𝒆𝑖,𝑗 is the feature of edge between node i and j. MLP is
multi-layer perceptron.
Figure S3. Architecture of enn-s2s.
The update function used to update node features is Gated Recurrent Unit (GRU)5, as
described by equation (S6)
𝒉𝑖𝑡 = 𝐺𝑅𝑈(𝒉𝑖
𝑡−1, 𝒙𝑖𝑡) (S6)
Where 𝒉𝑖𝑡 is the hidden state of node i in t-th time step.
For the readout phase, a feature vector for the whole graph is computed by enn-s2s
that is based on iterative content-based attention from Vinyals et al6 (vide equations S7-
S10)
𝒒𝑡 = LSTM(𝒒𝑡−1∗ ) (S7)
𝛼𝑖,𝑡 =exp(𝒙𝑖 ∙ 𝒒𝑡)
∑ exp (𝒙𝑗 ∙ 𝒒𝑡)𝑗∈𝑮 (S8)
𝒓𝑡 = ∑ 𝛼𝑖,𝑡
𝑁
𝑖=1
𝒙𝑖 (S9)
𝒒𝑡∗ = 𝒒𝑡 ∥ 𝒓𝑡 (S10)
where i indexes through each node feature vector 𝒙𝑖 , 𝒒𝑡 is a query vector which
allows us to read 𝒓𝑡 from the memories at t-th time step, 𝛼𝑖,𝑡 is attention coefficient
of node i at t-th time step, and LSTM is Long Short-Term Memory7 which calculates a
recurrent state. 𝑮 is the graph to which node i and j belong. 𝑁 is the number of nodes
in graph 𝑮. ∥ is concatenation. t is the step index, which is the number of times that
the state is computed. The maximum of t is 3 in this work. After the three steps, 𝒒𝑡∗ is
the feature vector for the whole graph to be fed to classifier that is two dense layers.
CCGNet-simple are proposed in this work in order to observe the impact of
different feature integration operation, where the message passing phase is three Graph-
CNN layers (vide Methods) and readout function is multi-head global attention (vide
Methods) with 10 heads. After the global attention, the global state U is fused into the
graph embedding.
Figure S4. Architecture of CCGNet-simple.
2. More examples for the attention visualization
Figure S5. Attention Visualization of BAFGEX.
Figure S6. Attention Visualization of VIHKUU.
Figure S9. Attention Visualization of MAQZEK.
Table S3. Solvents involved in collecting cocrystal positive samples from Cambridge
Structural Database.
Toluene 4-Chlorotoluene diglyme
DMSO-d6 1,3,5-trichlorobenzene iodobenzene
trichloromethane-d gamma-Butyrolactone 1,1,2-trichloroethane
ethoxyethane DL-sec-Butyl acetate formic acid
methylamine iodomethane dimethyl sulfoxide
p-Xylene methanamide 3-methyl-1-butanol
1-butanol Tetrahydrofuran bromobenzene
cyclohexanone chlorobenzene dimethoxymethane
1H-pyrrole Ethyl formate 2-butanone
2-butanol isobutanol N-Ethylmorpholine
1,1,2,2-tetrachloroethane N, N, N', N'-Tetramethylethylenediamine propan-2-ol
1,4-dioxane Ethanol 2-methyl-2-propanol
2-methylpyridine 3-methylpyridine 2-butoxyethanol
diethylenetriamine 2-methoxyethanol dibromomethane
1-methyl-2-pyrrolidone N, N-dimethylacetamide 2,2'-Dichlorodiethyl ether
Methyl acetate cyclopentane benzyl alcohol
benzene hexadecane water-d2
nitromethane hexamethyldisiloxane Hexane
1-Chloro-2-Methylpropane acetic anhydride propanenitrile
acetamide acetic acid Ethylene glycol
Diethylene glycol Isopropyl acetate Isopropyl ether
tetrachloromethane acetone acetophenone
nitrobenzene propionic acid 1,2-Propanediol
pentane 1,1-Dichloroethane butane-1,4-diol
1,3-dimethylbenzene 1,2-dihydrostilbene N, N-diethylethanamine
tribromomethane 2-propoxyethanol 1,2-Dichloroethane
1-propanol water phenylamine
heptane trichloromethane pyridine
cyclohexene cyclohexane Methanol
1,2-dimethoxyethane 3-pentanone fluorobenzene
epichlorohydrin acetonitrile dichloromethane
methanedithione 1-Octanol butanedioic acid
N, N-dimethylformamide 1,2-ethanediamine 2,4-pentanedione
o-Xylene Propylene glycol monomethyl ether acetate 1,3,5-trimethylbenzene
2-phenylacetonitrile 2-Chlorotoluene 1,2-dichlorobenzene
isophorone morpholine nitric acid
quinoline benzonitrile ethyl acetate
benzene-d6
Table S4. Performances of various models with different feature compositions for the
valid set of 10-fold cross validation.
Model PACC (%) NACC (%) BACC (%)
SVM 98.99 (±0.39) 87.55 (±2.72) 93.27 (±1.44)
RF 99.89(±0.06) 91.00 (±2.70) 95.44 (±1.34)
DNN 99.53(±0.29) 90.46(±2.34) 95.00(±1.07)
NCI1 99.01(±0.50) 85.96(±3.56) 92.49(±1.63)
enn-s2s 98.44(±0.45) 86.96(±3.68) 92.70(±1.76)
CCGNet-simple 99.46(±0.45) 93.45(±2.45) 96.46(±1.05)
CCGNet 99.89(±0.13) 96.98(±2.20) 98.43(±1.12)
Table S5. Refcodes of energetic cocrystals collected from CSD for the out-of-
distribution prediction
ABTNBA01 ABUNIU AJAKOL ANCTNB APANBZ
BIYXAL BIZZAO BNZTNB BZATNB20 CAZTBZ01
CBZTNB CECPEF CEZFOF DIFZOK DUKBOC
DUKBUI DUKCAP ERAFAE FETYAE FONHOH
FONJAV FUFSOQ GEXMAZ GEXMED GEXMIH
GEXMON HECREM HETTIM HETTOS HETTUY
HIVGAW HUZSEA IZUZUZ IZUZUZ01 JABYIX
JABYOD JOCTAZ KIZVAQ KOBFIQ KUMYOI
LOKJIH LUTGUD MAAZNB NIBJUF NIBZAM
NIKLOL NILCET POCVIP POSREV PUBMUU20
PUBWEO PUTWEI PUTWIM PUTWOS PUTWUY
PUTXAF PUTXEJ PVVBFD01 PVVBKP01 PYRTNB
QAPNAZ QARQUY QINLEH QOSRUN QOWBEJ
REDCIM REDCUY REDDAF REDDEJ RENPUV
RULLUF RUYKUR RUYLAY RUYLEC SERZIB
SKTNIB SOQPAQ STINBZ SUGCAY TETTAQ
TIVJUF TOZMUS UGUNAN URIHUZ URIJEL
URIJUB URILAJ USEZID VAZBIJ VIGKIF
VIGKUR VIGLEC WEPGEG WEPTAP WOJWIB
WOJWOH WOJXEY XAHZAH XAJJUQ XEMCID
XIZCER YEDVAH ZEBJOH01 ZEGKIF10 ZEVNUL
ZEZGIW ZEZHET ZILMUF ZOPGOC ZUBNOB
ZUBNUH ZZZAGS10 YOJQOG YOJXIH YOJXON
NILCIX ZEBJOH WOSFOB PEHSUS XAQFUS
ZASWAT ZASWEX ZASWIB GOWHIL ROSMOD
ROSMIX JAQVOP UWUGAW JABYIX MANLEV
BOXTET WUGWAY WIFYAN WIFXUG IDENEM
ZEZGOC ZEZHAP ZEZHOD URIJAH URIJIP
URIMAK URILOX URIKOW URILEN URIKUC
URIKIQ URIJOV URIKEM URIKAI UTEJAG
MEPWIQ FOYSUJ
Reference
1. Ioffe, S.; Szegedy, C., Batch Normalization: Accelerating Deep Network Training
by Reducing Internal Covariate Shift. 2015.
2. Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; Sanchez-Gonzalez, A.; Zambaldi, V.;
Malinowski, M.; Tacchetti, A.; Raposo, D.; Santoro, A.; Faulkner, R., Relational
inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261
2018.
3. Such, F. P.; Sah, S.; Dominguez, M. A.; Pillai, S.; Zhang, C.; Michael, A.; Cahill,
N. D.; Ptucha, R., Robust Spatial Filtering With Graph Convolutional Neural Networks.
IEEE J. Sel. Top. Signal Process. 2017, 11 (6), 884-896.
4. Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; Dahl, G. E., Neural message
passing for quantum chemistry. arXiv preprint arXiv:1704.01212 2017.
5. Cho, K.; Van Merrienboer, B.; Bahdanau, D.; Bengio, Y., On the Properties of
Neural Machine Translation: Encoder-Decoder Approaches. Computer ence 2014.
6. Vinyals, O.; Bengio, S.; Kudlur, M., Order Matters: Sequence to sequence for sets.
Computer ence 2015.
7. Hochreiter, S.; Schmidhuber, J., Long Short-Term Memory. Neural Computation
1997, 9 (8), 1735-1780.