71
EBI is an Outstation of the European Molecular Biology Laboratory. Accessing small molecule data using ChEBI Janna Hastings, Duncan Hull and Nico Adams Programmatic Access to Biological Databases (Perl) 22-26 February 2010 @ EBI

Accessing small molecule data using ChEBI

Embed Size (px)

DESCRIPTION

Presentation on Chemical Entities of Biological Interest (ChEBI) for the Programmatic Access to Biological Databases (Perl) course22-26 February 2010 @ EBI

Citation preview

Page 1: Accessing small molecule data using ChEBI

EBI is an Outstation of the European Molecular Biology Laboratory.

Accessing small molecule data using ChEBI

Janna Hastings, Duncan Hull and Nico Adams

Programmatic Access to Biological Databases (Perl)

22-26 February 2010 @ EBI

Page 2: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.232

Overview

• Introduction to ChEBI

• Searching and browsing

• Understanding the ontology

• Downloads and programmatic access

Page 3: Accessing small molecule data using ChEBI

EBI is an Outstation of the European Molecular Biology Laboratory.

Introduction to ChEBI

Block 1

Page 4: Accessing small molecule data using ChEBI

Small Molecules within Bioinformatics

Literature

Nucleotide sequences

Genomes

Expressions

Protein sequencesProtein domains, families

3D structures

Enzymes

Small molecules

Pathways Systems

Page 5: Accessing small molecule data using ChEBI

Literature

Nucleotide sequences

Genomes

Expressions

Protein sequencesProtein domains, families

3D structures

Enzymes

Small molecules

Pathways Systems

Small Molecules within Bioinformatics

Small moleculesSmall moleculesSmall moleculesSmall moleculesSmall molecules

Page 6: Accessing small molecule data using ChEBI

Small molecules participate in all the processes of life

Page 7: Accessing small molecule data using ChEBI

Signalingγ-aminobutyric acid

GABA: chief inhibitory neurotransmitter in the mammalian central nervous system.

In humans, also regulates muscle tone.

• synthesized by neurons

• found mostly as a zwitterion, that is, with the carboxyl group deprotonated and the amino group protonated (ChEBI:16865)

• conformational flexibility of GABA is important for its biological function, as it has been found to bind to different receptors with different conformations

• GABA deficiency linked to

• anxiety disorder, depression, alcoholism

• multiple sclerosis, action tremors, tardive dyskinesia

Page 8: Accessing small molecule data using ChEBI

Metabolism

Adenosine 5’-triphosphate (ATP): the

"molecular unit of currency" of intracellular

energy transfer. (ChEBI:15422)

• generated in the cell by energy-consuming processes, broken down by energy-releasing processes

• proteins that bind ATP do so in a characteristic protein fold known as the Rossmann fold, which is a general nucleotide-binding structural domain that can also bind the cofactor NAD

Adenosine 5'-triphosphate

Page 9: Accessing small molecule data using ChEBI

Enzymes

• Enzyme inhibitors are molecules that bind to enzymes and decrease their activity.

• Many drugs are enzyme inhibitors. They are also used as herbicides and pesticides.

• Enzyme activators bind to enzymes and increase their enzymatic activity.

• Enzyme activators are often involved in the allosteric regulation of enzymes in the control of metabolism.

clavulanic acid (ChEBI:48947)acts as a suicide inhibitor of bacterial β-lactamase

enzymes

Page 10: Accessing small molecule data using ChEBI

Pathways

http://www.genome.jp/kegg-bin/highlight_pathway?scale=1.0&map=map00231&keyword=tryptophan

Page 11: Accessing small molecule data using ChEBI

Systems biology

BioModels: quantitative models of biochemical and cellular systems

tryptophan

D-enantiomer: sweet L-enantiomer: bitter

Page 12: Accessing small molecule data using ChEBI

Drug design

• Ligand-based: relies on knowledge of other molecules that bind to the biological target of interest.

• Structure-based: relies on knowledge of the 3D structure of the biological target.

• A lead has• evidence that modulation of the target will have therapeutic value: e.g. disease

linkage studies showing associations between mutations in the biological target and certain disease states.

• evidence that the target is druggable, i.e. capable of binding to a small molecule and that its activity can be modulated by the small molecule.

• Target is cloned and expressed, then libraries of potential drug compounds are screened using screening assays

Page 13: Accessing small molecule data using ChEBI

Drug types 2003 - 2009

'Small molecules' in various shades of blue (http://chembl.blogspot.com/)

Page 14: Accessing small molecule data using ChEBI

Getting the chemistry right

• Thalidomide a non-barbiturate hypnotic

• Thalidomide displays immunosuppresive and anti-angiogenic activity. It inhibits release of tumor necrosis factor-alpha from monocytes, and modulates other

cytokine action.

• Thalidomide is racemic — it contains both left and right handed isomers in equal amounts: one enantiomer is effective against morning sickness, and the other is teratogenic.

• Enantiomers are interconverted in vivo. That is, if a human is given D-thalidomide or L-thalidomide, both isomers can be found in the serum. Hence, administering

only one enantiomer does not prevent the teratogenic effect in humans.

http://www.drugbank.ca/drugs/DB01041

Page 15: Accessing small molecule data using ChEBI

Small molecule data sources

Deposition-driven publicly available compound repository, containing more than 25 million unique structures.

http://pubchem.ncbi.nlm.nih.gov/

http://www.chemspider.com/

Automatic aggregation of publicly available chemistry data with crowdsourced annotation.

http://www.ebi.ac.uk/chebi/

Manually annotated database and ontology

Page 16: Accessing small molecule data using ChEBI

Small molecule annotations

• Often appear as free text in biological databases, in which they are not the core data

• Are frequently referred to by common names which may be chemically ambiguous• eg. adrenaline

= (S)-adrenaline ? (R)-adrenaline ?

• May be referred to by several different names• paracetamol, acetaminophen, 4-acetamidophenol,

N-(4-hydroxyphenyl)acetamide, …

Page 17: Accessing small molecule data using ChEBI

Chemicals - ChEBI

Visualisation

caffeine1,3,7-trimethylxanthine methyltheobromine

Nomenclature

Formula: C8H10N4O2Charge: 0 Mass: 194.19

Chemical data

metaboliteCNS stimulanttrimethylxanthines

Ontology

MSDchem: CFFKEGG DRUG: D00528

Database Xrefs

Chemical Informatics

InChI=1/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3

SMILES CN1C(=O)N(C)c2ncn(C)c2C1=O

Page 18: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2318

What is ChEBI?

• Chemical Entities of Biological Interest• Freely available• Focused on ‘small’ chemical entities (no proteins or

nucleic acids)• Illustrated dictionary of chemical nomenclature• High quality, manually annotated• Provides chemical ontology

Access ChEBI at http://www.ebi.ac.uk/chebi/

Page 19: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2319

ChEBI home page

Page 20: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2320

How is ChEBI maintained?

• Automatic loading of preliminary data

• Automatic loading of 2 star annotated data (ChEMBL and others)

• Manual annotation

• User requests via Submission Tool

• Public release: First Wednesday of every month.

Page 21: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2321

ChEBI entries contain

• A unique, unambiguous, recommended ChEBI name and an associated stable unique identifier

• An illustration where appropriate (compounds and groups, but generally not classes)

• A definition where appropriate (mostly classes)• A collection of synonyms, including the IUPAC

recommended name for the entity where appropriate• A collection of cross-references to other databases

• Links to the ChEBI ontology

Page 22: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2322

ChEBI entry view

Page 23: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2323

Automatic Cross-references

Page 24: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2324

Chemical Structures

• Chemical structure may be interactively exploredusing MarvinView applet

• Available in formats• Image• Molfile• InChI and InChIKey• SMILES

Page 25: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2325

Molfile format

Page 26: Accessing small molecule data using ChEBI

EBI is an Outstation of the European Molecular Biology Laboratory.

Time for Exercises

Page 27: Accessing small molecule data using ChEBI

EBI is an Outstation of the European Molecular Biology Laboratory.

Searching and browsing ChEBI

Block 2

Page 28: Accessing small molecule data using ChEBI

• Simple text search

ChEBI – Chemical Entities of Biological Interest12.04.2328

Simple text search

Wildcard: *

Enter any text

Page 29: Accessing small molecule data using ChEBI

Advanced text search

ChEBI – Chemical Entities of Biological Interest12.04.2329

Narrow to categoryAND, OR

and BUT NOT

Page 30: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2330

Structure searchSearch optionsStructure

drawing tools

Page 31: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2331

Search Results

Click to go to entry page

Hover-over for search menu

Page 32: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2332

Fingerprints

• Chemical substructure searching is computationally expensive…

Page 33: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2333

Fingerprints [2]

• … so heuristics must be used to decrease the number of search candidates

C8H9NO2

• Fingerprints are a generalized, abstract encoding of structural features which can be used as an effective screening device

cannot be a substructure of an entity which does not have at least 8 carbon atoms, 9 hydrogen atoms…

Page 34: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2334

Fingerprints [3]

• Encoding of structural patterns

water (HOH)0-bond paths H O H1-bond paths HO OH2-bond paths HOH

• Hashed to create bit strings, which are added together to give final fingerprint

Pattern Hashed bitmap H 0000010000O 0010000000HO 1010000000OH 0000100010HOH 0000000101Result: 1010110111

Page 35: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2335

Types of structure search

• Identity – based on InChI

• Substructure – uses fingerprints to narrow search range, then performs full substructure search algorithm

• Similarity – based on Tanimoto coefficient calculated between the fingerprints

InChI=1/H2O/h1H2

10101101110010110010

1010110111

0010110010Tanimoto(a,b)

= c / (a+b-c)

= 4 / (4+7-4) = 0.57

a

b

Page 36: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2336

Browse via Periodic Table

Molecular entities / Elements

Page 37: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2337

Navigate via links in ontology

Click to follow links

Page 38: Accessing small molecule data using ChEBI

EBI is an Outstation of the European Molecular Biology Laboratory.

Time for Exercises

Page 39: Accessing small molecule data using ChEBI

EBI is an Outstation of the European Molecular Biology Laboratory.

Understanding the ChEBI ontologyBlock 3

Page 40: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2340

Annotation of bioinformatics data

• Essential for capturing understanding and knowledge associated with core data

• Often captured in free text, which is easier to read and better for conveying understanding to a human audience, but…

• Difficult for computers to parse• Quality varies from database to database• Terminology used varies from annotator to annotator

• Towards annotation using standard vocabularies: ontologies within bioinformatics

Page 41: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2341

The ChEBI ontology

Organised into three sub-ontologies, namely• Molecular structure ontology

• Subatomic particle ontology

• Role ontology

(R)-adrenaline

Page 42: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2342

Molecular structure ontology

Page 43: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2343

Role ontology

Page 44: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2344

ChEBI ontology relationships

• Generic ontology relationships

• Chemistry-specific relationships

Page 45: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2345

Viewing ChEBI ontology

Page 46: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2346

Viewing ChEBI ontology [2]

Tree view

Page 47: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2347

Browsing ChEBI ontology (OLS)

Browse the ontology

Ontology Lookup Service (OLS): http://www.ebi.ac.uk/ontology-lookup/

Page 48: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2348

Ontology Lookup Service

• Provides a centralised query interface for ontology and controlled vocabulary lookup

• Can integrate any ontology available in OBO format• At last release, 58 ontologies integrated, including

• GO• ChEBI• Molecular interaction (PSI MI)• Pathway ontology (PW)• Human disease (DOID)• and many more…

• Provides a search and a browse facility, as well as displaying a graph of terms and relationships

Page 49: Accessing small molecule data using ChEBI

“The OBO Foundry is a collaborative experiment involving developers of science-based ontologies

who are establishing a set of principles for ontology development with the goal of creating a suite of

orthogonal interoperable reference ontologies in the biomedical domain.”

OBO Foundry

ChEBI – Chemical Entities of Biological Interest12.04.2349

Page 50: Accessing small molecule data using ChEBI

EBI is an Outstation of the European Molecular Biology Laboratory.

Time for Exercises

Page 51: Accessing small molecule data using ChEBI

EBI is an Outstation of the European Molecular Biology Laboratory.

Download and programmatic access

Block 4

Page 52: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2352

ChEBI domain modelSelf-referencing -

merging

Page 53: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2353

Compound IDs and Merging

• Compound accessions are maintained after merging, but…

only the main accession of a merged group is displayed

Navigated accession: CHEBI:5585

Main accession: CHEBI:15377

Page 54: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2354

Compound IDs and Merging [2]

ID STATUS

CHEBI_ACCN SOURCE PARENT_ID NAME DEFINITION

15377 C CHEBI:15377 ChEBI null water null

5585 C CHEBI:5585 KEGG 15377 null null

Additional acc Parent ID

ID COMPOUND

ACCN_NUMBER

TYPE STATUS

SOURCE

URL_ABBR

16213 5585 C00001 KEGG accn C KEGG KEGG

17314 5585 7732-18-5 CAS Registry

C KEGG null

This compound ID = additional acc

Page 55: Accessing small molecule data using ChEBI

Downloading ChEBI flavours

ChEBI – Chemical Entities of Biological Interest12.04.2355

• All downloads come in two flavours• 3 star only entries (manually annotated ChEBI

entries)• 2 and 3 star entries (manually annotated ChEBI,

ChEMBL and user submissions)

Page 56: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2356

Downloading ChEBI

• OBO file• Use on OBO-edit

• SDF File• Chemistry software compliant such as Bioclipse

• Flat file, tab delimited• Import all the data into Excel• Parse it into your own database structure

• Oracle binary dumps• Import into an oracle database

• Generic SQL insert statements• Import into MySQL or postgresql database

Page 57: Accessing small molecule data using ChEBI

• File format defined specifically for capturing biological ontologies

• Why use this format?• Use it if you are primarily interested in the ontology.• Don’t use it if you are interested in chemical

structural information.• What can you do with it?

• Can parse it directly using parsers such as OBO-Edit• Can upload and browse the ontology using OBO-Edit

OBO File Format

ChEBI – Chemical Entities of Biological Interest12.04.2357

General header information

Synonym types used in terms

Root terms

Relationships to other terms

Page 58: Accessing small molecule data using ChEBI

• Chemistry software compliant format• Why use this format?

• Use it to obtain the ChEBI entries with their chemical structural information.

• Don’t use it for the ontology.• What can I do with this format?

• Parse it using existing software libraries such as CDK.• Open it in standalone tools such as Bioclipse• Copy and paste individual structures into JChemPaint

SDF File Lite format

ChEBI – Chemical Entities of Biological Interest12.04.2358

Entries separated by $$$$

Page 59: Accessing small molecule data using ChEBI

SDF File complete format

ChEBI – Chemical Entities of Biological Interest12.04.2359

Entries separated by

$$$$

Page 60: Accessing small molecule data using ChEBI

Flat-file tab and comma delimited

ChEBI – Chemical Entities of Biological Interest12.04.2360

• Why use this format?• Use it to obtain the entire ChEBI database structure.

• What can I do with this format?• Open it using Excel• Import it into a relevant database such as Oracle

Page 61: Accessing small molecule data using ChEBI

Table dumps

• Similar structure to the flat-file tab delimited files• Why use this format?• Use it to obtain the entire ChEBI database structure.

• Oracle binary dumps• Import into an oracle database

• Generic SQL insert statements• Import into MySQL or postgresql database

ChEBI – Chemical Entities of Biological Interest12.04.2361

Page 62: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2362

Web services

• Allow users to create their own applications to query data

User application

Page 63: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2363

The ChEBI web service

• Programmatic access to a ChEBI entry

• SOAP based Java implementation• Clients currently available in Java and perl

• Methods• getLiteEntity• getCompleteEntity and getCompleteEntityByList• getOntologyParents• getOntologyChildren and getAllOntologyChildrenInPath• getStructureSearch

• Documented at http://www.ebi.ac.uk/chebi/webServices.do.

Page 64: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2364

Web service client object model

getLiteEntity

getCompleteEntity

getOntology (Parents and

Children)

Page 65: Accessing small molecule data using ChEBI

Methods and parameters (1)

ChEBI – Chemical Entities of Biological Interest12.04.2365

Page 66: Accessing small molecule data using ChEBI

Methods and parameters (2)

ChEBI – Chemical Entities of Biological Interest12.04.2366

Page 67: Accessing small molecule data using ChEBI

Methods and parameters (3)

ChEBI – Chemical Entities of Biological Interest12.04.2367

Page 68: Accessing small molecule data using ChEBI

EBI is an Outstation of the European Molecular Biology Laboratory.

Time for Exercises

Page 69: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2369

For more information

• ftp://ftp.ebi.ac.uk/pub/software/webservices/Perl/WSChebiSOAPLite-2.0.zip

• Email: [email protected]

• SourceForge: https://sourceforge.net/projects/chebi/

• User Manual: http://www.ebi.ac.uk/chebi/userManualForward.do

• RSS Feed

Page 70: Accessing small molecule data using ChEBI

ChEBI – Chemical Entities of Biological Interest12.04.2370

Acknowledgements

• The ChEBI teamNico Adams Paula de MatosAdriano Dekker Marcus EnnisJanna Hastings Duncan HullZara Josephs Steve TurnerChristoph Steinbeck

• Everyone @ the EBI and elsewhere who uses or contributes to ChEBI

ChEBI is funded by the European Commission under SLING, grant agreement number 226073 (Integrating Activity) within Research Infrastructures of the FP7 Capacities Specific Programme; and by the BBSRC, grant agreement number BB/G022747/1 within the "Bioinformatics and biological resources" fund.

Page 71: Accessing small molecule data using ChEBI

EBI is an Outstation of the European Molecular Biology Laboratory.

Thank you