32
Mining Big Datasets to Create and Validate Machine Learning Models Alex M. Clark 1* and Sean Ekins 2,3,4* 1 Molecular Materials Informatics, 1900 St. Jacques #302, Montreal H3J 2S1, Quebec, Canada 2 Collaborations Pharmaceuticals, Inc., 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526, USA 3 Collaborative Drug Discovery, 1633 Bayshore Highway, Suite 342, Burlingame, CA 94010, USA 4 Collaborations in Chemistry, 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526, USA

Mining Big datasets to create and validate machine learning models

Embed Size (px)

Citation preview

Page 1: Mining Big datasets to create and validate machine learning models

Mining Big Datasets to Create and Validate Machine Learning Models

Alex M. Clark1* and Sean Ekins2,3,4*

1 Molecular Materials Informatics, 1900 St. Jacques #302, Montreal H3J 2S1, Quebec, Canada2 Collaborations Pharmaceuticals, Inc., 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526, USA

3 Collaborative Drug Discovery, 1633 Bayshore Highway, Suite 342, Burlingame, CA 94010, USA4 Collaborations in Chemistry, 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526, USA

Page 2: Mining Big datasets to create and validate machine learning models

The Prevailing Modeling Paradigm

• Generate own dataset from own wet experiment

• Work with a collaborator to get experimental data

• Go out and mine literature for data• Curate, issues with intra lab variability, data quality

• Mine databases• Issues with chemistry quality, errors in data

Page 3: Mining Big datasets to create and validate machine learning models
Page 4: Mining Big datasets to create and validate machine learning models

Just a matter of scale?

Drug Discovery’s definition of Big data

Everyone else’s definition of Big data

Page 5: Mining Big datasets to create and validate machine learning models

• Data Sources

• PubChem

• ChEMBL

• ToxCast over 1800 molecules tested against over 800 endpoints

Where can we get the datasets

Page 6: Mining Big datasets to create and validate machine learning models

Mining for gold

Page 7: Mining Big datasets to create and validate machine learning models

Melting point and solubility

• But no structures!

Page 8: Mining Big datasets to create and validate machine learning models

Open source – but much smaller

400 diverse, drug-like molecules active against neglected diseases

400 cpds from around 20,000 hits

generated screening campaign ~ four million compounds from the libraries of St. Jude Children's Research Hospital, TN, USA, Novartis and GSK.

Many screens completed

Page 9: Mining Big datasets to create and validate machine learning models

Bigger datasets and model collections

• Profiling “big datasets” is going to be the norm.

• A recent study mined PubChem datasets for compounds that have rat in vivo acute toxicity data

• This could be used in other big data initiatives like ToxCast (> 1000 compounds x 800 assays) and Tox21 etc.

• Kinase screening data (1000s mols x 100s assays)

• GPCR datasets etc (1000s mols x 100s assays)

Zhang J, Hsieh JH, Zhu H (2014) Profiling Animal Toxicants by Automatically Mining Public Bioassay Data: A Big Data Approach for Computational Toxicology. PLoS ONE 9(6): e99863. doi:10.1371/journal.pone.0099863http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0099863

Page 10: Mining Big datasets to create and validate machine learning models
Page 11: Mining Big datasets to create and validate machine learning models

‘Bigger’ and not ‘Big’

Page 12: Mining Big datasets to create and validate machine learning models

(220463)

(102633)

(23797)

(346893)

(2273)

(1783)

(1248)

(5304)

(218640)

(102634)

(23737)

(345011)

1771924

Are bigger models better for tuberculosis ?

Ekins et al., J Chem Inf Model

54: 2157-2165 (2014)

Page 13: Mining Big datasets to create and validate machine learning models

No relationship between internal or external ROC and the number of molecules in the training set?

PCA of combined data and ARRA(red)

Ekins et al., J Chem Inf Model

54: 2157-2165 (2014)

Internal and leave out 50%x100 ROC track each other

External ROC less correlation

Smaller models do just as well with external testing

~350,000

Page 14: Mining Big datasets to create and validate machine learning models

MoDELS RESIDE IN PAPERS

NOT ACCESSIBLE…THIS IS

UNDESIRABLE

How do we share them?

How do we use Them?

Page 15: Mining Big datasets to create and validate machine learning models

Open Extended Connectivity Fingerprints

ECFP_6 FCFP_6• Collected,

deduplicated, hashed

• Sparse integers

• Invented for Pipeline Pilot: public method, proprietary details

• Often used with Bayesian models: many published papers

• Built a new implementation: open source, Java, CDK– stable: fingerprints don't change with each new toolkit release

– well defined: easy to document precise steps

– easy to port: already migrated to iOS (Objective-C) for TB Mobile app

• Provides core basis feature for CDD open source model serviceClark et al., J Cheminform 6:38 2014

Page 16: Mining Big datasets to create and validate machine learning models

Uses Bayesian algorithm and FCFP_6 fingerprints

Bayesian models

Clark et al., J Cheminform 6:38 2014

Page 17: Mining Big datasets to create and validate machine learning models

Exporting models from CDD

Clark et al., JCIM 55: 1231-1245 (2015)

Page 18: Mining Big datasets to create and validate machine learning models

What if the models were already built for you

• Instead of having to go into a database and find data

• The models are already prebuilt

• Ready to use

• Shareable

• Create a repository of models

Page 19: Mining Big datasets to create and validate machine learning models

Previous work by others

• Using large datasets to predict targets with Bayesian algorithm

• Bayesian classifier - 698 target models (> 200,000 molecules, 561,000 measurements) Paolini et al 2006

• 246 targets (65,241 molecules) Similarity ensemble analysis Keiser et al 2007

• 2000 targets (167,000 molecules) target identification from zebrafish screen Laggner et al 2012

• 70 targets (100,269 data points) Bender et al 2007

• Many others…..

• None of these enable you qualitatively or quantitatively predict activity for a single target.

Page 20: Mining Big datasets to create and validate machine learning models

Recent Studies

• Bit folding – trade off between performance & efficacy

• Model cut-off selection for cross validation

• Scalability of ECFP6 and FCFP6 using ChEMBL 20 mid size datasets

• CDK codebase on Github (http://github.com/cdk/cdk: look for class org.open-science.cdk.fingerprint.model.Bayesian )

• Made the models accessible http://molsync.com/bayesian2

Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60

Page 21: Mining Big datasets to create and validate machine learning models

What do 2000 ChEMBL models look like

Folding bit size

AverageROC

http://molsync.com/bayesian2Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60

Page 22: Mining Big datasets to create and validate machine learning models

ChEMBL 20

• Skipped targets with > 100,000 assays and sets with < 100 measurements

• Converted data to –log

• Dealt with duplicates

• 2152 datasets

• Cutoff determination

• Balance active/ inactive ratio

• Favor structural diversity and activity distribution

Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60

Page 23: Mining Big datasets to create and validate machine learning models

Desirability score• ROC integral for model using subset of molecules and

threshold for partitioning active / inactive (higher is better)

• Second derivative of population interpolated from the current threshold (lower is better)

• Ratio of actives to inactives if the collection partitioned (actives+1) / (inactives+1) or reciprocal..whichever greater

Ekins et al Drug Metab Dispos In Press 2015

Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60

Page 24: Mining Big datasets to create and validate machine learning models

Models from ChEMBL data

http://molsync.com/bayesian2 Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60

Page 25: Mining Big datasets to create and validate machine learning models

Results

• Bit folding – plateau at 4096, can use 1024 with little degredation

• Cut off – works well

• Evaluated balanced training: test and diabolical were test and training sets are structurally differentEasy ROC 0.83 ± 0.11 Hard ROC 0.39 ± 0.23

Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60

Page 26: Mining Big datasets to create and validate machine learning models

Models in mobile app

• Added atom coloring using ECFP6 fingerprints

• Red and green high and low probability of activity, respectively

Clark and Ekins, J Chem Inf Model. 2015 Jun 22;55(6):1246-60

Page 27: Mining Big datasets to create and validate machine learning models

ToxCast data

• Few studies use the ToxCast data for machine learning

• Recent reviews Sipes et al., Chem Res Toxicol. 2013 Jun 17; 26(6): 878–895.

• Liu et al., Chem Res Toxicol. 2015 Apr 20;28(4):738-51

• A set of 677 chemicals was represented by 711 in vitro bioactivity descriptors • (from ToxCast assays), 4,376 chemical structure descriptors (from QikProp,

OpenBabel, PaDEL, and PubChem), and three hepatotoxicity categories

• six machine learning algorithms: linear discriminant analysis (LDA), Naïve Bayes (NB), support vector machines (SVM), classification and regression trees (CART), k-nearest neighbors (KNN), and an ensemble of these classifiers (ENSMB)from animal studies)

• nuclear receptor activation and mitochondrial functions were frequently found in highly predictive classifiers of hepatotoxicity

• CART, ENSMB, and SVM classifiers performed the best

Page 28: Mining Big datasets to create and validate machine learning models

CDD Models for human P450s (NVS data) from ToxCast (n=1787) <1uM cutoff

CYP1A1 CYP1A2 CYP2B6 CYP2C18

CYP2C19 CYP2C9 CYP3A4 CYP3A5

Page 29: Mining Big datasets to create and validate machine learning models

ToxCast models in a mobile app

IC50 1A2 = 2.25 uMIC50 2C9 = 3.55 uMIC50 2C19 = 10.8 uM

In vitro data

Courtesy Dr. Joel Freundlich

Page 30: Mining Big datasets to create and validate machine learning models

Future work

• Extend approach further to bigger ADME datasets and others in ChEMBL

• PubChem

• Overlap with ChEMBL

• ToxCast

• Smaller dataset 1800 molecules and 800 assays

• Overlap of targets with PubChem and ChEMBL?

• Create a tox prediction system with 800 models?

• Validation of models

• Challenge how do you do this on a large scale when you have 100s-1000’s

• How will algorithms handle the big datasets 500K to 1M compounds?

• Bigger data possibly not needed for good models

• Mobile apps have useful cheminformatics features - aid anyone to do drug discovery

• Models are compact < 1MB and portable

• The age of model sharing is here

Page 31: Mining Big datasets to create and validate machine learning models

Wanted

• “BIggEr” SmALL mOLECuLE SCrEENINg dATASETS

• Preferably > 500,000 – 1,000,000 molecules

with data

• To test how machine learning Algorithms

Scale

• Contact [email protected]

Page 32: Mining Big datasets to create and validate machine learning models

Krishna Dole and all at CDD, and many others …Funding: Bill and Melinda Gates Foundation (Grant#49852) 9R44TR000942-02 NIH NIAD, Software: Biovia

Joel Freundlich

Alex Clark

Robert Reynolds

Antony Williams