Upload
sean-ekins
View
545
Download
3
Embed Size (px)
Citation preview
Mining Big Datasets to Create and Validate Machine Learning Models
Alex M. Clark1* and Sean Ekins2,3,4*
1 Molecular Materials Informatics, 1900 St. Jacques #302, Montreal H3J 2S1, Quebec, Canada2 Collaborations Pharmaceuticals, Inc., 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526, USA
3 Collaborative Drug Discovery, 1633 Bayshore Highway, Suite 342, Burlingame, CA 94010, USA4 Collaborations in Chemistry, 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526, USA
The Prevailing Modeling Paradigm
• Generate own dataset from own wet experiment
• Work with a collaborator to get experimental data
• Go out and mine literature for data• Curate, issues with intra lab variability, data quality
• Mine databases• Issues with chemistry quality, errors in data
Just a matter of scale?
Drug Discovery’s definition of Big data
Everyone else’s definition of Big data
• Data Sources
• PubChem
• ChEMBL
• ToxCast over 1800 molecules tested against over 800 endpoints
Where can we get the datasets
Mining for gold
Melting point and solubility
• But no structures!
Open source – but much smaller
400 diverse, drug-like molecules active against neglected diseases
400 cpds from around 20,000 hits
generated screening campaign ~ four million compounds from the libraries of St. Jude Children's Research Hospital, TN, USA, Novartis and GSK.
Many screens completed
Bigger datasets and model collections
• Profiling “big datasets” is going to be the norm.
• A recent study mined PubChem datasets for compounds that have rat in vivo acute toxicity data
• This could be used in other big data initiatives like ToxCast (> 1000 compounds x 800 assays) and Tox21 etc.
• Kinase screening data (1000s mols x 100s assays)
• GPCR datasets etc (1000s mols x 100s assays)
Zhang J, Hsieh JH, Zhu H (2014) Profiling Animal Toxicants by Automatically Mining Public Bioassay Data: A Big Data Approach for Computational Toxicology. PLoS ONE 9(6): e99863. doi:10.1371/journal.pone.0099863http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0099863
‘Bigger’ and not ‘Big’
(220463)
(102633)
(23797)
(346893)
(2273)
(1783)
(1248)
(5304)
(218640)
(102634)
(23737)
(345011)
1771924
Are bigger models better for tuberculosis ?
Ekins et al., J Chem Inf Model
54: 2157-2165 (2014)
No relationship between internal or external ROC and the number of molecules in the training set?
PCA of combined data and ARRA(red)
Ekins et al., J Chem Inf Model
54: 2157-2165 (2014)
Internal and leave out 50%x100 ROC track each other
External ROC less correlation
Smaller models do just as well with external testing
~350,000
MoDELS RESIDE IN PAPERS
NOT ACCESSIBLE…THIS IS
UNDESIRABLE
How do we share them?
How do we use Them?
Open Extended Connectivity Fingerprints
ECFP_6 FCFP_6• Collected,
deduplicated, hashed
• Sparse integers
• Invented for Pipeline Pilot: public method, proprietary details
• Often used with Bayesian models: many published papers
• Built a new implementation: open source, Java, CDK– stable: fingerprints don't change with each new toolkit release
– well defined: easy to document precise steps
– easy to port: already migrated to iOS (Objective-C) for TB Mobile app
• Provides core basis feature for CDD open source model serviceClark et al., J Cheminform 6:38 2014
Uses Bayesian algorithm and FCFP_6 fingerprints
Bayesian models
Clark et al., J Cheminform 6:38 2014
Exporting models from CDD
Clark et al., JCIM 55: 1231-1245 (2015)
What if the models were already built for you
• Instead of having to go into a database and find data
• The models are already prebuilt
• Ready to use
• Shareable
• Create a repository of models
Previous work by others
• Using large datasets to predict targets with Bayesian algorithm
• Bayesian classifier - 698 target models (> 200,000 molecules, 561,000 measurements) Paolini et al 2006
• 246 targets (65,241 molecules) Similarity ensemble analysis Keiser et al 2007
• 2000 targets (167,000 molecules) target identification from zebrafish screen Laggner et al 2012
• 70 targets (100,269 data points) Bender et al 2007
• Many others…..
• None of these enable you qualitatively or quantitatively predict activity for a single target.
Recent Studies
• Bit folding – trade off between performance & efficacy
• Model cut-off selection for cross validation
• Scalability of ECFP6 and FCFP6 using ChEMBL 20 mid size datasets
• CDK codebase on Github (http://github.com/cdk/cdk: look for class org.open-science.cdk.fingerprint.model.Bayesian )
• Made the models accessible http://molsync.com/bayesian2
Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
What do 2000 ChEMBL models look like
Folding bit size
AverageROC
http://molsync.com/bayesian2Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
ChEMBL 20
• Skipped targets with > 100,000 assays and sets with < 100 measurements
• Converted data to –log
• Dealt with duplicates
• 2152 datasets
• Cutoff determination
• Balance active/ inactive ratio
• Favor structural diversity and activity distribution
Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
Desirability score• ROC integral for model using subset of molecules and
threshold for partitioning active / inactive (higher is better)
• Second derivative of population interpolated from the current threshold (lower is better)
• Ratio of actives to inactives if the collection partitioned (actives+1) / (inactives+1) or reciprocal..whichever greater
Ekins et al Drug Metab Dispos In Press 2015
Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
Models from ChEMBL data
http://molsync.com/bayesian2 Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
Results
• Bit folding – plateau at 4096, can use 1024 with little degredation
• Cut off – works well
• Evaluated balanced training: test and diabolical were test and training sets are structurally differentEasy ROC 0.83 ± 0.11 Hard ROC 0.39 ± 0.23
Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
Models in mobile app
• Added atom coloring using ECFP6 fingerprints
• Red and green high and low probability of activity, respectively
Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60
ToxCast data
• Few studies use the ToxCast data for machine learning
• Recent reviews Sipes et al., Chem Res Toxicol. 2013 Jun 17; 26(6): 878–895.
• Liu et al., Chem Res Toxicol. 2015 Apr 20;28(4):738-51
• A set of 677 chemicals was represented by 711 in vitro bioactivity descriptors • (from ToxCast assays), 4,376 chemical structure descriptors (from QikProp,
OpenBabel, PaDEL, and PubChem), and three hepatotoxicity categories
• six machine learning algorithms: linear discriminant analysis (LDA), Naïve Bayes (NB), support vector machines (SVM), classification and regression trees (CART), k-nearest neighbors (KNN), and an ensemble of these classifiers (ENSMB)from animal studies)
• nuclear receptor activation and mitochondrial functions were frequently found in highly predictive classifiers of hepatotoxicity
• CART, ENSMB, and SVM classifiers performed the best
CDD Models for human P450s (NVS data) from ToxCast (n=1787) <1uM cutoff
CYP1A1 CYP1A2 CYP2B6 CYP2C18
CYP2C19 CYP2C9 CYP3A4 CYP3A5
ToxCast models in a mobile app
IC50 1A2 = 2.25 uMIC50 2C9 = 3.55 uMIC50 2C19 = 10.8 uM
In vitro data
Courtesy Dr. Joel Freundlich
Future work
• Extend approach further to bigger ADME datasets and others in ChEMBL
• PubChem
• Overlap with ChEMBL
• ToxCast
• Smaller dataset 1800 molecules and 800 assays
• Overlap of targets with PubChem and ChEMBL?
• Create a tox prediction system with 800 models?
• Validation of models
• Challenge how do you do this on a large scale when you have 100s-1000’s
• How will algorithms handle the big datasets 500K to 1M compounds?
• Bigger data possibly not needed for good models
• Mobile apps have useful cheminformatics features - aid anyone to do drug discovery
• Models are compact < 1MB and portable
• The age of model sharing is here
Wanted
• “BIggEr” SmALL mOLECuLE SCrEENINg dATASETS
• Preferably > 500,000 – 1,000,000 molecules
with data
• To test how machine learning Algorithms
Scale
• Contact [email protected]
Krishna Dole and all at CDD, and many others …Funding: Bill and Melinda Gates Foundation (Grant#49852) 9R44TR000942-02 NIH NIAD, Software: Biovia
Joel Freundlich
Alex Clark
Robert Reynolds
Antony Williams