40
The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

Embed Size (px)

Citation preview

Page 1: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

The NIH Roadmapand PubChem

Gary Wiggins

I533

Spring 2006

Page 2: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

NIH Roadmap

• Series of initiatives designed to pursue major opportunities in biomedical research and gaps in current knowledge that cannot be addressed by any single NIH Institute or Center

• Goal: enable rapid transformation of new scientific knowledge into tangible benefits for public health

• http://nihroadmap.nih.gov/

Page 3: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

NIH Molecular Libraries and Imaging Initiative

• Part of the “New Pathways to Discovery” area• Goal: augment the “toolbox” for understanding

the functionally interconnected molecular events that maintain health and lead to disease

• Build on high-throughput, highly specific, mechanism-based biological assays

• Aims to develop and discover small molecules that hold promise as research tools to probe cellular physiology and pathophysiology

Page 4: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

NIH Molecular Imaging Roadmap

• High specificity/high sensitivity molecular imaging probes

• Molecular imaging and contrast database

• Imaging probe development center

Page 5: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

NIH Roadmap Molecular Libraries Initiative (MLI)

• A series of integrated research programs with the goal of making small molecule screening and screening data more widely available to the research community

• http://nihroadmap.nih.gov/molecularlibraries/index.asp

Page 6: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

MLI Aims

• Go beyond the identification of compounds with potential therapeutic properties

• Will result in the identification of compounds to use as probes to study cellular processes in health and disease

• Biological screening data, assay protocols, and chemical structures for compounds to be publicly available in PubChem

Page 7: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

NIH MLI Components

• Molecular Libraries Screening Center Network (MLSCN)

• Cheminformatics (centered around PubChem)

• Technology development

Page 8: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

NIH MLI Technology Development Areas

• Chemical diversity– Pilot-scale libraries for investigation of novel chemical

diversity space– Novel methods for natural product chemistry

• Development of assays• Novel instrumentation and detection

technologies for high throughput screening• Datasets and algorithms for better prediction of

absorption, distribution, metabolism, excretion, and toxicity properties of small molecules

Page 9: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

Assay Guidance Manual

• Originally written as a guide for therapeutic projects teams within Eli Lilly; covers:

– Identifying potential assay formats compatible with High Throughput Screen (HTS) and Structure Activity Relationship (SAR)

– Developing optimal assay reagents – Optimizing assay protocol with respect to sensitivity, dynamic

range, signal intensity and stability – Adaptation of the assay to the microtiter plate formats – Validation of the assay performance – Orthogonal follow-up assays for chemical probe validation and

refinement

• http://www.ncgc.nih.gov/guidance/index.html

Page 10: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

NIH Molecular Libraries Small Molecule Repository

• Run under contract by Discovery Partners International

• Collects samples for high throughput biological screening and distributes them to the NIH Molecular Libraries Screening Center Network

• http://mlsmr.discoverypartners.com/MLSMR_HomePage/

Page 11: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

Roadmap MLI Funded Areas

• Molecular Libraries Screening Centers (MLSCN)– Ten of them at academic institutions– NIH Chemical Genomics Center

• http://www.ncgc.nih.gov/

• http://nihroadmap.nih.gov/molecularlibraries/fundedresearch.asp

Page 12: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

Roadmap MLI Funded Areas

• Submitting assays for HTS in the MLSCN– 28 different submissions

• Pilot-scale libraries for HTS (8)• New methodologies for natural product

chemistry (6)• Assay development for HT molecular

Screening (39)• Molecular libraries screening

instrumentation (4)

Page 13: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

Roadmap MLI Funded Areas

• Novel preclinical tools for predictive ADME-Toxicology (5)

• Innovation in molecular imaging probes (11)

• Development of high-resolution probes for cellular imaging (9)

Page 14: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

Roadmap MLI Funded Areas

• Exploratory Centers for Cheminformatics Research at:– Indiana University– University of Michigan– Rensselaer Polytechnic Institute– MIT– North Carolina State University, Raleigh– University of North Carolina, Chapel Hill

Page 15: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

IU Projects Underway

• Innovative cross-screen analysis of NIH Developmental Therapeutics Project Human Tumor Cell Line data

• Development of cheminformatics web services and use cases in Taverna

• Development of a novel interface for the analysis of PubChem HTS data

• A structure storage and searching system for Distributed Drug Discovery

• Quantum chemical computer simulations database • Training modules for cheminformatics instruction on the Web • Web guide for essential cheminformatics resources

(http://www.indiana.edu/~cheminfo/cicc/resources.html)• Design of a grid-based distributed data architecture for

chemistry

Page 16: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

NIH NCI Developmental Therapeutics Program

• The NCI has been collecting and testing compounds for 50 years. For about 30 years this has been managed by the Developmental Therapeutics Program (DTP). From 1955 to 1985 the primary test was to look for increase in survival of mice bearing transplantable tumors. In 1990, the primary screen switched to looking for inhibition of growth of 60 human tumor cell lines in culture. DTP also ran the anti-HIV screen for about 10 years and managed the yeast anti-cancer screen in which compounds were tested for their ability to inhibit the growth of yeast strains with defined mutations in cell cycle genes. These assays provide the bulk of the data DTP makes publicly available.

Page 17: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

NIH NCI DTP

• DTP’s correlation analyses allow one to associate a list of genes with a given compound or vice versa

• Want to get workflows running that integrate chemical structure data with the gene expression and sequence data in the bioinformatics world

• Need help in the practical details of creating web services that will work in the mygrid/Taverna (or equivalent) framework

Page 18: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

NIH DTP Data

Compounds Data Points

Chemical Structures

~265,000

60 Cell Assay ~43,000 ~12,000,000

Anti-HIV Assay ~45,000 ~90,000

Yeast Assay ~110,000 ~600,000

in vivo Antitumor

~120,000 ~1,100,000

Page 19: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

NCI Panel of 60 Human Cell Cancer Lines

• Protein levels

• RNA measurements

• Mutation status

• Enzyme activity levels

Page 20: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

NIH DTP’s COMPARE Program

• The pattern of activity across all 60 cell lines that a compound exhibits is related to the mechanism of action– Can be used to discover the mechanism of a

compound’s actions by looking at which compounds of known activity are correlated with the unknown

– Has been used to discover novel compounds with a given activity by testing the top correlating compounds to a compound with the activity of interest

– Used to prioritize compounds that seem to have a novel mechanism

– Calculates a correlation coefficient between two vectors in 60-dimensional space

Page 21: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

NIH DTP

• Given a compound tested in the 60 cell assay, one can look for the genes whose expression most highly correlates with the ability of the compound to inhibit cell growth. Conversely, given a gene, one can look for compounds whose ability to inhibit cell growth is most highly correlated with the expression of that gene.

Page 22: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

NIH DTP Needs

• Grid Web services• Visualization – may use VOTables• Tools to squish a set of points in a large

dimensional space down into 2D or 3D while attempting to preserve the relative distances– Looking at the nearest neighbors of the point of

interest with such a map could reveal relations that would be missed in just a table listed by distance

Page 23: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

NIH DTP Main Search Page

• http://dtp.nci.nih.gov/docs/dtp_search.html

Page 24: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

High-Throughput Screening (HTS)

• the integration of biological, chemical and clinical data

• automated & standardized statistical analysis of large and complex data volumes

• biological and chemical profiling by use of statistical analyses on combined data from screening, pharmacological profiling, and structural properties

Page 25: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

Other Potential Partners

• Center for Chemical Genomics at the University of Michigan– http://www.lifesciences.umich.edu/institute/labs/ccg/index.html

• Milos Novotny (IUB Chemistry): $3.5 million National Center for Research Resources (NIH) grant to conduct research in the analysis of glycoproteins

• David Flockhart (IUB School of Medicine): Cytochrome P450 database http://medicine.iupui.edu/flockhart/

Page 26: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

PubChem

• 5,298,729 compounds as of 1/16/2006• the place to go for biological and related data• the central depository of all information related to

the NIH Roadmap project• expected that the actual data will reside there,

and only some things may be held elsewhere, with PubChem acting as a pointer– May even have the images from screens and assays

• chemical structures from Elsevier's xPharm database

Page 27: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

PubChem Data (as of 10/25/2005)

• Bioassays deposited 177

• Bioassay test results 3,158,669

• Substances deposited 7,848,390

• Unique Substances 5,269,228

Page 28: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

PubChem Technical Details• Entrez database system

– For all textual information in the database• NCBI Toolkit - an open-source infrastructure toolkit• OpenEye OEChem toolkit and associated software

– for most structure standardization tasks, plus some structure identifier computations like SMILES and IUPAC name generation.

• NIST InChI library– for computing the InChI identifier

• CACTVS Chemoinformatics Toolkit– for structure depictions, structure database system, structure query

execution, structure deduplication, some property calculations and the WWW structure and image editors

• Various general low-level support libraries, e.g.,– zlib, png, gd and freetype libraries

• In-house code– for the queuing system, deposition system, display CGIs, structure

standardization set-up, update scripts, etc.

Page 29: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

PubChem Database Display and Query Subsystems - 1

• A special Entrez version– stores textual and numerical data– hosted on a MS SQL Server relational

database cluster– holds precomputed structure images for

display, ASN.1 structure data blobs for download, and extensive crosslinking functions for linking to other NCBI databases

Page 30: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

PubChem Display and Query Subsystems - 2

• structure search component– based on the CACTVS structure search system– pseudo-relational in nature (the underlying storage

manager is the Sleepycat BDB database manager)– hosted on a Linux server cluster– structure search file is not stored in the SQL

database, but there is an automatic synchronization and update mechanism

– Some data, such as Lipinski filter criteria, are stored in both databases

Page 31: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

PubChem Programming Utilities

• Entrez Programming Utilities– http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_

help.html

• CACTVS chemoinformatics toolkit– a full ASN.1 parser for CACTVS understands

the full data spec for structures and assay data

– modules for talking to the Entrez database for accessing structure blobs and some other NCBI systems

Page 32: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

PubChem Data Deposition

• PubChem Deposition Gateway• http://pubchem.ncbi.nlm.nih.gov/deposit/deposit.cgi

Page 33: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

PubChem Sketcher

• No need to worry about the type of structure definition displayed in the top line

• uses a hidden internal representation to transfer the information

• http://pubchem.ncbi.nlm.nih.gov/search/

Page 34: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

InChI, The IUPAC International Chemical Identifier

• Official site: http://www.iupac.org/inchi/

• Unofficial InChI FAQ:

• http://wwmm.ch.cam.ac.uk/inchifaq/

• WSDL InChI server at

• http://wwmm.ch.cam.ac.uk/gridsphere/gridsphere

Page 35: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

Searching InChIs

• Sample search:• “InChI=1/C17H14O4S/c1-22(19,20)14-9-7-12(8-10-14)15-11-21-

17(18)16(15)13-5-3-2-4-6-13/h2-10H,11H2,1H3”• Must include the quotation marks• no carriage return or line feed in the string• InChI code for C60 fullerene:

– InChI=1/C60/c1-2-5-6-3(1)8-12-10-4(1)9-11-7(2)17-21-13(5)23-24-14(6)22-18(8)28-20(12)30-26-16(10)15(9)25-29-19(11)27(17)37-41-31(21)33(23)43-44-34(24)32(22)42-38(28)48-40(30)46-36(26)35(25)45-39(29)47(37)55-49(41)51(43)57-52(44)50(42)56(48)59-54(46)53(45)58(55)60(57)59

Page 36: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

ACD Labs and InChIs

• Transferring structures from PubChem to ACD/ChemSketch

• http://www.acdlabs.com/download/technotes/90/draw_db/pubchem.pdf

Page 37: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

InChI Support in BKChem

• BKchem - a free chemical drawing program

• Successfully reads most InChIs

• http://bkchem.zirael.org/inchi_en.html

Page 38: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

InChI

• PubChem sketcher also supports generation of InChI strings

• http://pubchem.ncbi.nlm.nih.gov/edit/– change the format selector to "InChI"

Page 39: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

Protein Data Bank (PDB) Data Dictionaries

• develop software and data definitions to support the structural genomics efforts

• enable high-throughput data deposition• data dictionaries define items at the level

of detail of the materials and methods section of a journal

• uses macromolecular Crystallographic Information File (mmCIF) data dictionaries

• http://mmcif.pdb.org/index.html

Page 40: The NIH Roadmap and PubChem Gary Wiggins I533 Spring 2006

Translate WSDL to Human Readable Form

• http://soapclient.com/soaptest.html