View
1.249
Download
2
Embed Size (px)
DESCRIPTION
This is a presentation given to a group of students at the UNC Eshelman School of Pharmacy. As chemists many of us want to resource information that is high quality, accurate and addresses our query. With the increasing proliferation of online chemistry resources it is very common for us to turn to these resources to source data. However, are resources such as Wikipedia, PubChem and the plethora of databases delivering information for metabolism, medicinal chemistry and synthetic chemistry trustworthy? Which of these resources, if any, should be treated as authorities? What is the most integrated approach to resource chemistry related data online? What approaches can be taken to validate the data that is available and how can individual scientists participate in helping to improve the content and quality of chemistry related data on the web. Antony Williams is ChemSpiderman. He started the ChemSpider database (www.chemspider.com) as a hobby to deliver a free platform for the community to source chemistry related data. Within three years the system was acquired by the Royal Society of Chemistry and now serves up close to 25 million chemical structures linked to over 400 data sources across the internet and offers individual scientists the opportunity to host and share their data with the community and to participate in data curation and annotation. Tony will share his experiences of building this chemistry database with a focus on data validation and curation and sourcing high quality data. During the presentation he will discuss ways to check chemical structure representations before submission to public systems for searching and provide an overview of chemical identifiers such as SMILES strings and the International Chemical Identifier (InChI) allows for the interlinking of resources. Attendees can expect to leave the session with a deeper understanding of utilizing the internet to resource chemistry related data.
Citation preview
Chemicals, Chemical Identifiers and Navigating Through Databases
Antony WilliamsUNC Chapel Hill, October 2010
Chemistry on the Internet
Where do you source chemistry information? What can you trust online? How can you recognize potential issues? Cross-referencing and curating data
What is the Structure of Vitamin K?
MeSH
A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K
What is the Structure of Vitamin K1?
Wikipedia
What is the Structure of Vitamin K1?
CAS’s Common Chemistry
PubChem
“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-enyl)naphthalene-1,4-dione”
Variants of systematic names on PubChem
2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl 2-methyl-3-(3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl
Bioassay Data are Associated…
Lack of Stereochemistry
ChEBI – Manual Curation
Molfiles (http://en.wikipedia.org/wiki/Chemical_table_file)
Molfiles 10 9 0 0 1 0 0 0 0 0 1 V2000 31.2937 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 26.6526 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 31.2937 -7.7066 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 30.1161 -9.6877 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 25.5096 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 28.9731 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 27.8163 -9.7016 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 26.6664 -7.7066 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 32.4367 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 30.1161 -11.0177 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 3 1 2 0 0 0 0 4 1 1 0 0 0 0 9 1 1 0 0 0 0 7 2 1 0 0 0 0 5 2 2 0 0 0 0 8 2 1 0 0 0 0 6 4 1 0 0 0 0 4 10 1 6 0 0 0 7 6 1 0 0 0 0 M END
Molfiles Molfiles are the primary exchange format between
structure drawing packages Can be different between different drawing packages Most commonly carry X,Y coordinates for layout Can support polymers, organometallics, etc. Can carry 3D coordinates
SMILES (http://en.wikipedia.org/wiki/SMILES)
SMILES is a common format Can support polymers,
organometallics, etc. Does NOT carry X,Y or Z
coordinates for layout so requires layout algorithms – can be problematic!
Generally different between drawing packages
Stereo
Tautomers
SMILES ACD/Labs CC(C)CCC[C@@H](C)CCC[C@@H](C)CCCC(\
C)=C\CC2=C(C)C(=O)c1ccccc1C2=O
OpenEye CC1=C(C(=O)c2ccccc2C1=O)C/C=C(\C)/
CCC[C@H](C)CCC[C@H](C)CCCC(C)C
ChEMBL CC(C)CCC[C@@H](C)CCC[C@@H](C)CCC\
C(=C\CC1=C(C)C(=O)c2ccccc2C1=O)\C
The InChI Identifier
InChI
SINGLE code base managed by IUPAC – integrated into drawing packages. No variability as with SMILES
InChI Strings can be reversed to structures – same problem as with SMILES – no layout
Well adopted by the community (databases, publishers, blogs, Wikipedia) – good for searching the internet
Multiple Layers
Tautomers – “Mobile H Perception”
Double Bond Orientation
Stereo
Checking for Stereochemistry
Checking for StereochemistryUse your drawing package!
Checking for Stereochemistry
Checking for Stereochemistry
Checking for Stereochemistry
InChIStrings Hash to InChIKeys
PubChem InChIKeys
MBWXNTAXLNYFJB-NKFFZRIASA-N MBWXNTAXLNYFJB-LKUDQCMESA-N MBWXNTAXLNYFJB-UHFFFAOYSA-N MBWXNTAXLNYFJB-FAKCLFGASA-N MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label) MBWXNTAXLNYFJB-ODDKJFTJSA-N MBWXNTAXLNYFJB-KSVLJPARSA-N MBWXNTAXLNYFJB-UDCSOKOMSA-N MBWXNTAXLNYFJB-JHBCSKSVSA-N MBWXNTAXLNYFJB-JXAKDHTRSA-N
PubChem InChIKeys
MBWXNTAXLNYFJB-NKFFZRIASA-N MBWXNTAXLNYFJB-LKUDQCMESA-N MBWXNTAXLNYFJB-UHFFFAOYSA-N MBWXNTAXLNYFJB-FAKCLFGASA-N MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label) MBWXNTAXLNYFJB-ODDKJFTJSA-N MBWXNTAXLNYFJB-KSVLJPARSA-N MBWXNTAXLNYFJB-UDCSOKOMSA-N MBWXNTAXLNYFJB-JHBCSKSVSA-N MBWXNTAXLNYFJB-JXAKDHTRSA-N
Databases and Standardization
Databases and Standardization
InChI
No support for polymers, organometallics
Many option settings can lead to variability and make integration across databases difficult – FixedH option especially problematic
“Slight” chance of collisions of InChIKeys
VERY USEFUL FOR INTEGRATING THE WEB
Vancomycin
Vancomycin
Search Molecular SKELETON
Search Full Molecule
Full Skeleton Search: 104 Hits
Full Molecule Search: 4 Hits
Where is chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science
Linked Data on the Web
Taken from: Rafael Sidis’ Blog
www.chemspider.com
Search for a Chemical…by name
Available Information…
Linked to vendors, safety data, toxicity, metabolism
How do we build it?
25 million chemicals from 400 data sources We deal in Molfiles or SDF files – including
coordinates We do rudimentary filtering – valence checking,
charge imbalance – prior to deposition We have our own “business logic” to standardize We use InChI to “aggregate tautomers” to one
record We link out to external sites where possible using
their IDs
Inherited Errors
We have inherited errors from every database… all public compound databases, including ours, have errors
“Incorrect” structures – assertions, timelines etc “Incorrect” names associated with structures Properties Links Publications ENORMOUS CHALLENGE
Compounds and Identifiers
Be careful searching by Name!
Determining the correct structure by name searching is difficult online! Good, not perfect Wikipedia ChEBI/ChEMBL ChemIDPlus ChemSpider
Be VERY careful with MOST databases
Validating structures
Check for “full stereo” and use stereo descriptors especially for checking!
Check for quality of associated data sources Check against reference literature when available
– but it can be wrong Question EVERYTHING!
Online Curation
Online databases generally do NOT allow curation or annotation
If you find errors they stay there! ChemSpider is unique…immediate curation
ChemSpider live demo following this lecture Searching Deposition and Curation ChemSpider SyntheticPages
Thank you
Email: [email protected] Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams