24
Why have one model when you could have thousands? Alex M. Clark, Ph.D. January 2016 © 2016 Molecular Materials Informatics, Inc. http://molmatinf.com

SLAS2016: Why have one model when you could have thousands?

Embed Size (px)

Citation preview

Page 1: SLAS2016: Why have one model when you could have thousands?

Why have one modelwhen you could have thousands?

Alex M. Clark, Ph.D.

January 2016

© 2016 Molecular Materials Informatics, Inc. http://molmatinf.com

Page 2: SLAS2016: Why have one model when you could have thousands?

MOLECULAR MATERIALS INFORMATICS

Cheminformatics• Generally 2D structures with activities:

• Look for trends: structure-activity relationships

• Leverages quantity rather than detail... but quality is also supremely important

2

Page 3: SLAS2016: Why have one model when you could have thousands?

MOLECULAR MATERIALS INFORMATICS

Structure-Activity Models

• Bayesian models very effective

• Tabulate structure fingerprints for actives vs. inactives

• Prediction: ordering, probability

• Low maintenance

3

10001001000001101001011101110111

• ECFP6 fingerprints

0.8343ROC integral

Page 4: SLAS2016: Why have one model when you could have thousands?

MOLECULAR MATERIALS INFORMATICS

The Data Problem• > 10 years ago: quantity the biggest issue

- open structure-activity data rare and small - paid collections, big pharma registration

• ~5 years ago: quality the biggest issue

- huge databases, e.g. PubChem, ChemSpider, ZINC, vendors, etc.

- generally no provenance: anything goes

• Cheminformatics seemed to be stagnant...

- new methods, same mediocre performance

4

Page 5: SLAS2016: Why have one model when you could have thousands?

MOLECULAR MATERIALS INFORMATICS

The Data Solution• Recently: some excellent developments

- Open Melting Points: models actually work - PubChem: direct submission by scientists - CDD: store and share with same platform - ChEMBL: large, open, high quality, broad

• Can now have quantity and quality, without fees or restrictions

• Evidence suggests that the data was holding us back, not the methods

5

Page 6: SLAS2016: Why have one model when you could have thousands?

MOLECULAR MATERIALS INFORMATICS

ChEMBL• Hierarchy looks like this:

• What we need it to be:

6

target assay activity molecule

dataset assayactivitymolecule

target

mergedactivity

materialsfor model

Page 7: SLAS2016: Why have one model when you could have thousands?

MOLECULAR MATERIALS INFORMATICS

Slicing & Dicing

• Divide by target, species and type of assay (protein binding, whole cell, ADMET, etc.)

• Measurements: [Ki, Kd] or [IC50, EC50, AC50, GI50]

• Units: [M, mM, μM, nM]

• Relations [=, <, >, ≤, ≥]

• Total of 8646 groups of structure-activity

7

Page 8: SLAS2016: Why have one model when you could have thousands?

MOLECULAR MATERIALS INFORMATICS

Consolidation• Strip salts / adducts

• Common organic elements only:

- [H, C, N, O, P, S, F, Cl, Br, I, B, Si, Se, As, Sb, Te]

• Duplicate molecules: merge activities, e.g.

- [1.2, 1.8] ➡ 1.5 ± 0.3 - [> 5, 5.5] ➡ > 5 - [< 1, 3.5] ➡ invalid

• Keep groups with at least 100 molecules remaining

• Now down to 1839 datasets

8

Page 9: SLAS2016: Why have one model when you could have thousands?

MOLECULAR MATERIALS INFORMATICS

Model Building• Bayesian models need a threshold...

9

pIC50 9 157 3

inactive active

• Suitable values often known; large scale automation: must estimate

• Score: population, balance, trial Bayesian

• See J. Chem. Inf. Model. 55, 1246-1260 (2015)

Page 10: SLAS2016: Why have one model when you could have thousands?

MOLECULAR MATERIALS INFORMATICS

Model Results

• Metrics generally good for Bayesian models using ECFP6 fingerprints

• Note that not all datasets have any SAR

10

AU

C (

easy

)

AU

C (

hard

)

population population

Page 11: SLAS2016: Why have one model when you could have thousands?

MOLECULAR MATERIALS INFORMATICS

Deliverable• Datasets with acceptable models: 1826

- list of unique molecules - activity (standard molar units) - threshold (active/inactive) - target & assay provenance - Bayesian model (ECFP6)

• Targets are diverse, data is high quality: thanks to the ChEMBL project

• Can apply all models to any molecule...

• Start with a set of discontinued drugs...

11

Page 12: SLAS2016: Why have one model when you could have thousands?

MOLECULAR MATERIALS INFORMATICS

Discontinued Drugs12

• ~50 drugs that passed most tests, but never made it to market

• Maybe they cure something else?

Page 13: SLAS2016: Why have one model when you could have thousands?

MOLECULAR MATERIALS INFORMATICS

Detail & Visualisation13

Atom-centric Bayesian

Honeycomb clustering

Page 14: SLAS2016: Why have one model when you could have thousands?

MOLECULAR MATERIALS INFORMATICS

PolyPharma app

• Proof of concept tools being explored for several drug discovery collaborations

• Interactive functionality demonstrated as a mobile app for iPhone & iPad

• Free to use

14

http://itunes.apple.com/app/polypharma/id1025327772

Page 15: SLAS2016: Why have one model when you could have thousands?

MOLECULAR MATERIALS INFORMATICS 15

Page 16: SLAS2016: Why have one model when you could have thousands?

MOLECULAR MATERIALS INFORMATICS 16

Page 17: SLAS2016: Why have one model when you could have thousands?

MOLECULAR MATERIALS INFORMATICS 17

Page 18: SLAS2016: Why have one model when you could have thousands?

MOLECULAR MATERIALS INFORMATICS 18

Page 19: SLAS2016: Why have one model when you could have thousands?

MOLECULAR MATERIALS INFORMATICS 19

Page 20: SLAS2016: Why have one model when you could have thousands?

MOLECULAR MATERIALS INFORMATICS 20

Page 21: SLAS2016: Why have one model when you could have thousands?

MOLECULAR MATERIALS INFORMATICS 21

Page 22: SLAS2016: Why have one model when you could have thousands?

MOLECULAR MATERIALS INFORMATICS 22

Page 23: SLAS2016: Why have one model when you could have thousands?

MOLECULAR MATERIALS INFORMATICS 23

Page 24: SLAS2016: Why have one model when you could have thousands?

Acknowledgments

http://molmatinf.com http://molsync.com http://cheminf20.org

@aclarkxyz

• Collaborative Drug Discovery

• Sean Ekins

• Society for Laboratory Automation & Screening

• Inquiries to [email protected]