24
Klaus Gubernator, Craig James, eMolecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical Structure Search Engines in Cyberspace

Klaus Gubernator, Craig James, e Molecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical

Embed Size (px)

Citation preview

Page 1: Klaus Gubernator, Craig James, e Molecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical

Klaus Gubernator, Craig James, eMolecules Inc.

ACS 232nd National Meeting

Division of Chemical Information

  San Francisco, September 14, 2006

Chemical Structure Search Engines in Cyberspace

Page 2: Klaus Gubernator, Craig James, e Molecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical

The web has revolutionized the way we retrieve information

Chemistry is a late participant in this revolution

Chemistry on the Internet

Page 3: Klaus Gubernator, Craig James, e Molecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical

N

Search Google Images for “Aspirin”

Page 4: Klaus Gubernator, Craig James, e Molecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical

http://scripts.iucr.org/cgi-bin/paper?cnor=a12172&buy=yes

Acta Cryst. (1975). B31, 1427-1429    The crystal structure of 7-amino-2H,4H-vic-triazolo[4,5-c]-1,2,6-

thiadiazine 1,1-dioxide (ATT)C. Foces-Foces, F. H. Cano and S. García-Blanco

Buy onlineYou may purchase this article in PDF and/or HTML formats. For

purchasers in the UK, and for purchasers elsewhere in the European Community who do not have a VAT number, VAT will be added to the price of the article.

Format*   PDF (US $40, plus US $7 for EC purchases)

Structure of “triazolo thiadiazine”

Page 5: Klaus Gubernator, Craig James, e Molecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical

Datasets (which are, in contrast to other dataset lists, available in a structural format)This list will be expanded continuously. Please don't hesitate to make published datasets publicly available here.

Currently available: 44 DatasetsNote: The Briem/Lessel and Hert/Willett Dataset are only available as MDDR ID's due to license reasons. Please contact MDL for further information on the database. The datasets have nonethless been included here because they are standard datasets for similarity searching. – Andreas Bender

Binary (active/inactive) datasets QSAR datasets QSPR datasets Toxicity datasets Metabolism datasets Permeability datasets Docking datasets Mechanistic datasets Mixed/Other datasets

Cheminformatics.org

Page 6: Klaus Gubernator, Craig James, e Molecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical

CS(=O)(=O)Nc1ccc(cc1OC2CCCCC2)N(=O)=OCS(=O)(=O)Nc1cc2CCC(=O)c2cc1Oc3ccc(F)cc3FCS(=O)(=O)Nc1cc2CCC(=O)c2cc1Sc3ccc(F)cc3FCS(=O)(=O)Nc1ccc(cc1Sc2ccc(F)cc2F)C(=O)NCS(=O)(=O)Nc1ccc(cc1Sc2ccc(Cl)cc2Cl)S(=O)(=O)NCOc1ccc(cc1)c2sc(nc2c3ccc(cc3)S(=O)(=O)C)c4ccccc4ClCOc1ccc(cc1)c2sc(nc2c3ccc(SC)cc3)c4ccccc4ClCS(=O)(=O)c1ccc(cc1)n2nc(cc2c3ccc(F)cc3)C(F)(F)FCS(=O)(=O)c1ccc(cc1)n2nc(cc2c3ccc(Br)cc3)C(F)(F)FCc1ccc(cc1)c2cc(nn2c3ccc(cc3)S(=O)(=O)N)C(F)(F)FCS(=O)(=O)c1ccc(cc1)c2snnc2c3ccc(F)cc3CC(=O)c1nc(c(o1)c2ccc(c(F)c2)S(=O)(=O)N)c3ccccc3Cc1nc(C2CCCCC2)c(o1)c3ccc(c(F)c3)S(=O)(=O)NCS(=O)(=O)c1ccc(cc1)c2[nH]c(nc2C3CCCCC3)C(F)(F)FCS(=O)(=O)c1ccc(cc1)c2[nH]c(nc2c3ccc(F)cc3)C(F)(F)FCS(=O)(=O)c1ccc(cc1)C2=C(C(=O)OC32CC3)c4ccccc4CS(=O)(=O)c1ccc(cc1)C2=C(C(=O)OC32CCCC3)c4ccccc4CS(=O)(=O)c1ccc(cc1)c2cnn(Cc3ccccc3)c(=O)c2c4ccccc4CS(=O)(=O)c1ccc(cc1)c2nn(Cc3ccccc3)c(c2c4ccc(F)cc4)C(F)(F)FNS(=O)(=O)c1ccc(cc1)c2c(CO)onc2c3ccccc3CS(=O)(=O)c1ccc(cc1)c2cc(Cl)nn2c3ccc(F)cc3NS(=O)(=O)c1ccc(cc1)c2cc(nn2c3ccc(F)cc3)C(F)(F)FNS(=O)(=O)c1ccc(cc1)n2nc(cc2c3nc4cccc(F)c4s3)C(F)F

Stahl dataset

Page 7: Klaus Gubernator, Craig James, e Molecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical

Unnamed -MTS- 06200418093D 0 0.00000 0.00000 0

13 13 0 0 0 0 0 0 0 0 1 V2000 0.0180 -0.0030 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0 1.7880 0.0070 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2.4880 0.0110 1.2120 C 0 0 0 0 0 0 0 0 0 0 0 0 3.8880 0.0200 1.2120 C 0 0 0 0 0 0 0 0 0 0 0 0 4.5880 0.0240 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 6.0030 0.0330 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 6.6610 1.1880 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 6.0400 2.2500 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 8.1410 1.1970 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 8.7570 0.1440 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 8.7890 2.3360 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 3.8880 0.0200 -1.2130 C 0 0 0 0 0 0 0 0 0 0 0 0 2.4880 0.0120 -1.2120 C 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 2 3 2 0 0 0 0 3 4 1 0 0 0 0 4 5 2 0 0 0 0 …M END> <BIO>48.00

$$$$

Yokoyama dataset

Page 8: Klaus Gubernator, Craig James, e Molecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical

Search Genbank for “aattccgg”

Page 9: Klaus Gubernator, Craig James, e Molecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical

C

Page 10: Klaus Gubernator, Craig James, e Molecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical

C

Page 11: Klaus Gubernator, Craig James, e Molecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical

Why is so little chemistry on the web?

Tradition? Strong providers of subscription services? Searching for chemical structures is

significantly more difficult than text searching?

Chemical identifiers are not standardized?

Page 12: Klaus Gubernator, Craig James, e Molecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical

Open Access Chemical Search Engines

PubChem - NIH

ChemBank – Harvard

ZINC – UCSF

ChemDB – UC Irvine

ChemExper - Lausanne

ChemFinder – CambridgeSoft

Page 13: Klaus Gubernator, Craig James, e Molecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical

www.emolecules.com

New Chemistry Search Engine A large database of publicly available

molecular structures Launched November 2005 50,000 searches per month, rapidly

growing

Page 14: Klaus Gubernator, Craig James, e Molecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical

www.emolecules.com

Free chemistry search site for publicly available chemical information

Page 15: Klaus Gubernator, Craig James, e Molecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical
Page 16: Klaus Gubernator, Craig James, e Molecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical

Advanced Search

Powerful features: hit list management union, intersect, subtract, difference manual selection export lists in many formats persistent hitlists

Page 17: Klaus Gubernator, Craig James, e Molecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical

T

O

Page 18: Klaus Gubernator, Craig James, e Molecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical
Page 19: Klaus Gubernator, Craig James, e Molecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical

Content: 16M entries, 5.6M structures

Academic and government databases NIST WebBook DrugBank Protein Ligands

Chemical suppliers 150 electronic catalogs included

Future goal All publicly available chemical information

Page 20: Klaus Gubernator, Craig James, e Molecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical

Why is it so fast?

Novel chemical search engine technology

Method represents a major departure from previously known algorithms- Molecular keys (MDL)

- Fingerprints (Daylight)

- Feature Trees (BioSolv)

Page 21: Klaus Gubernator, Craig James, e Molecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical

Search engine technology

Analyze each molecule for distinguishing structural features

Generate all features algorithmically

Normalize features and use them for indexing

Result: very fast searches

N

O

NCl

H2N

O

H2N

H2N

HN

N

N

Cl

HN

N

Page 22: Klaus Gubernator, Craig James, e Molecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical

Who is eMolecules?

Klaus Gubernator

Craig A. James

Rashmi Mistry

Page 23: Klaus Gubernator, Craig James, e Molecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical

Summary

Free for depositors and users Very fast search engine High quality user interface Rich functionality Complementary with other engines

Page 24: Klaus Gubernator, Craig James, e Molecules Inc. ACS 232nd National Meeting Division of Chemical Information San Francisco, September 14, 2006 Chemical

Contact Information

Klaus Gubernator

[email protected]

Skype: emolecules

+1-858-764-1941