Mining Big datasets to create and validate machine learning models

Mining Big Datasets to Create and Validate Machine Learning Models

Alex M. Clark1* and Sean Ekins2,3,4*

1 Molecular Materials Informatics, 1900 St. Jacques #302, Montreal H3J 2S1, Quebec, Canada2 Collaborations Pharmaceuticals, Inc., 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526, USA

3 Collaborative Drug Discovery, 1633 Bayshore Highway, Suite 342, Burlingame, CA 94010, USA4 Collaborations in Chemistry, 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526, USA

http://www.collaborationspharma.com/

http://www.collaborationspharma.com/

The Prevailing Modeling Paradigm

• Generate own dataset from own wet experiment

• Work with a collaborator to get experimental data

• Go out and mine literature for data• Curate, issues with intra lab variability, data quality

• Mine databases• Issues with chemistry quality, errors in data

Just a matter of scale?

Drug Discovery’s definition of Big data

Everyone else’s definition of Big data

• Data Sources

• PubChem

• ChEMBL

• ToxCast over 1800 molecules tested against over 800 endpoints

Where can we get the datasets

Mining for gold

Melting point and solubility

• But no structures!

Open source – but much smaller

400 diverse, drug-like molecules active against neglected diseases

400 cpds from around 20,000 hits

generated screening campaign ~ four million compounds from the libraries of St. Jude Children's Research Hospital, TN, USA, Novartis and GSK.

Many screens completed

Bigger datasets and model collections

• Profiling “big datasets” is going to be the norm.

• A recent study mined PubChem datasets for compounds that have rat in vivo acute toxicity data

• This could be used in other big data initiatives like ToxCast (> 1000 compounds x 800 assays) and Tox21 etc.

• Kinase screening data (1000s mols x 100s assays)

• GPCR datasets etc (1000s mols x 100s assays)

Zhang J, Hsieh JH, Zhu H (2014) Profiling Animal Toxicants by Automatically Mining Public Bioassay Data: A Big Data Approach for Computational Toxicology. PLoS ONE 9(6): e99863. doi:10.1371/journal.pone.0099863http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0099863

http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0099863

‘Bigger’ and not ‘Big’

(220463)

(102633)

(23797)

(346893)

(2273)

(1783)

(1248)

(5304)

(218640)

(102634)

(23737)

(345011)

1771924

Are bigger models better for tuberculosis ?

Ekins et al., J Chem Inf Model

54: 2157-2165 (2014)

No relationship between internal or external ROC and the number of molecules in the training set?

PCA of combined data and ARRA(red)

Ekins et al., J Chem Inf Model

54: 2157-2165 (2014)

Internal and leave out 50%x100 ROC track each other

External ROC less correlation

Smaller models do just as well with external testing

~350,000

MoDELS RESIDE IN PAPERS

NOT ACCESSIBLE…THIS IS

UNDESIRABLE

How do we share them?

How do we use Them?

Open Extended Connectivity Fingerprints

ECFP_6 FCFP_6• Collected,

deduplicated, hashed

• Sparse integers

• Invented for Pipeline Pilot: public method, proprietary details

• Often used with Bayesian models: many published papers

• Built a new implementation: open source, Java, CDK– stable: fingerprints don't change with each new toolkit release

– well defined: easy to document precise steps

– easy to port: already migrated to iOS (Objective-C) for TB Mobile app

• Provides core basis feature for CDD open source model serviceClark et al., J Cheminform 6:38 2014

Uses Bayesian algorithm and FCFP_6 fingerprints

Bayesian models

Clark et al., J Cheminform 6:38 2014

Exporting models from CDD

Clark et al., JCIM 55: 1231-1245 (2015)

What if the models were already built for you

• Instead of having to go into a database and find data

• The models are already prebuilt

• Ready to use

• Shareable

• Create a repository of models

Previous work by others

• Using large datasets to predict targets with Bayesian algorithm

• Bayesian classifier - 698 target models (> 200,000 molecules, 561,000 measurements) Paolini et al 2006

• 246 targets (65,241 molecules) Similarity ensemble analysis Keiser et al 2007

• 2000 targets (167,000 molecules) target identification from zebrafish screen Laggner et al 2012

• 70 targets (100,269 data points) Bender et al 2007

• Many others…..

• None of these enable you qualitatively or quantitatively predict activity for a single target.

Recent Studies

• Bit folding – trade off between performance & efficacy

• Model cut-off selection for cross validation

• Scalability of ECFP6 and FCFP6 using ChEMBL 20 mid size datasets

• CDK codebase on Github (http://github.com/cdk/cdk: look for class org.open-science.cdk.fingerprint.model.Bayesian )

• Made the models accessible http://molsync.com/bayesian2

Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60

http://github.com/cdk/cdk

http://molsync.com/bayesian2

What do 2000 ChEMBL models look like

Folding bit size

AverageROC

http://molsync.com/bayesian2Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60


ChEMBL 20

• Skipped targets with > 100,000 assays and sets with < 100 measurements

• Converted data to –log

• Dealt with duplicates

• 2152 datasets

• Cutoff determination

• Balance active/ inactive ratio

• Favor structural diversity and activity distribution


Desirability score• ROC integral for model using subset of molecules and

threshold for partitioning active / inactive (higher is better)

• Second derivative of population interpolated from the current threshold (lower is better)

• Ratio of actives to inactives if the collection partitioned (actives+1) / (inactives+1) or reciprocal..whichever greater

Ekins et al Drug Metab Dispos In Press 2015


Models from ChEMBL data

http://molsync.com/bayesian2 Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60


Results

• Bit folding – plateau at 4096, can use 1024 with little degredation

• Cut off – works well

• Evaluated balanced training: test and diabolical were test and training sets are structurally differentEasy ROC 0.83 ± 0.11 Hard ROC 0.39 ± 0.23


Models in mobile app

• Added atom coloring using ECFP6 fingerprints

• Red and green high and low probability of activity, respectively


ToxCast data

• Few studies use the ToxCast data for machine learning

• Recent reviews Sipes et al., Chem Res Toxicol. 2013 Jun 17; 26(6): 878–895.

• Liu et al., Chem Res Toxicol. 2015 Apr 20;28(4):738-51

• A set of 677 chemicals was represented by 711 in vitro bioactivity descriptors • (from ToxCast assays), 4,376 chemical structure descriptors (from QikProp,

OpenBabel, PaDEL, and PubChem), and three hepatotoxicity categories

• six machine learning algorithms: linear discriminant analysis (LDA), Naïve Bayes (NB), support vector machines (SVM), classification and regression trees (CART), k-nearest neighbors (KNN), and an ensemble of these classifiers (ENSMB)from animal studies)

• nuclear receptor activation and mitochondrial functions were frequently found in highly predictive classifiers of hepatotoxicity

• CART, ENSMB, and SVM classifiers performed the best

CDD Models for human P450s (NVS data) from ToxCast (n=1787) <1uM cutoff

CYP1A1 CYP1A2 CYP2B6 CYP2C18

CYP2C19 CYP2C9 CYP3A4 CYP3A5

ToxCast models in a mobile app

IC50 1A2 = 2.25 uMIC50 2C9 = 3.55 uMIC50 2C19 = 10.8 uM

In vitro data

Courtesy Dr. Joel Freundlich

Future work

• Extend approach further to bigger ADME datasets and others in ChEMBL

• PubChem

• Overlap with ChEMBL

• ToxCast

• Smaller dataset 1800 molecules and 800 assays

• Overlap of targets with PubChem and ChEMBL?

• Create a tox prediction system with 800 models?

• Validation of models

• Challenge how do you do this on a large scale when you have 100s-1000’s

• How will algorithms handle the big datasets 500K to 1M compounds?

• Bigger data possibly not needed for good models

• Mobile apps have useful cheminformatics features - aid anyone to do drug discovery

• Models are compact < 1MB and portable

• The age of model sharing is here

Wanted

• “BIggEr” SmALL mOLECuLE SCrEENINg dATASETS

• Preferably > 500,000 – 1,000,000 molecules

with data

• To test how machine learning Algorithms

Scale

• Contact [email protected]

Krishna Dole and all at CDD, and many others …Funding: Bill and Melinda Gates Foundation (Grant#49852) 9R44TR000942-02 NIH NIAD, Software: Biovia

Joel Freundlich

Alex Clark

Robert Reynolds

Antony Williams

Science

Mining Big datasets to create and validate machine learning models