Upload
rguha
View
952
Download
1
Tags:
Embed Size (px)
Citation preview
PMML for QSAR Model Exchange
Rajarshi Guha, Ph.D. NIH Center for Advancing TranslaEonal Sciences
[email protected] / h0p://rguha.net
Background • CheminformaEcs – QSAR, diversity analysis, virtual screening, fragments, polypharmacology, networks
• RNAi screening, high content imaging • Extensive use of machine learning • All Eed together with soLware development (GUI’s, libraries)
• Contributed pmml.lm to the PMML package
QuanEtaEve Structure AcEvity RelaEonships
Why is QSAR Useful?
• Lets us predict whether a chemical is likely to be toxic, avoiding animal tesEng
• PrioriEze molecules from a high throughput screen of 300K molecules
• Predict whether a molecule will be (sufficiently) soluble in water
• IdenEfy molecules with anE-‐malarial properEes • Accurate, predic-ve models can save significant -me and money (and cute bunnies)
Lots and Lots of Models
• Hundreds of such models published in the literature – Usually in the form of tables of regression coefficients (if we’re lucky)
– If the paper describes an SVM model, no chance of reproducing the results
• How can we exchange QSAR models?
QSAR Model Exchange
• Build models in …., • Save them in PMML • Distribute • … • Profit? – Not always
The bo0leneck is evalua:ng descriptors for the new observa:ons to supply to the model
CheminformaEcs in R
• rcdk provides cheminformaEcs support in R – Load and parse molecular file formats – Evaluate numerical descriptors from chemical structures
R Programming Environment
rJava
CDK Jmol
rcdk
XML
rpubchem
fingerprint
CheminformaEcs in R
library(pmml)!library(rcdk)!data(bpdata)!mols <- parse.smiles(bpdata[, 1])!descNames <- unique(unlist(sapply('topological', ! get.desc.names)))!descs <- eval.desc(mols, descNames)!model <- lm(BP ~ khs.sCH3 + khs.sF + TopoPSA + VABC, data.frame(bpdata,descs))!pmml(model)!
R, rcdk, PMML
• rcdk provides the means to take in molecules and output a PMML encoded model
• One could record appropriate funcEons/classes in the document and use that info to evaluate descriptor for new observaEons
• Since rcdk is based on the Java CDK library, could also use jpmml, a Java API for PMML documents