Upload
orcid-0000-0002-2668-4821
View
403
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Today ChemSpider (www.chemspider.com) is one of the community’s primary online resources for chemists. Now hosting over 28 million unique chemical compounds linked to over 400 data sources, ChemSpider offers its users a structure centric platform facilitating access to publications and patents, experimental and predicted property data, spectral data and many other forms of data and information that can benefit a chemist. ChemSpider is a crowdsourcing platform allowing the community to contribute data directly to the database by allowing the deposition and sharing of structure data, properties, spectra and reaction syntheses. The crowdsourcing also allows for the annotation and curation of existing data thereby allowing the community to assist in the much-needed curation and validation of chemistry data on the internet. This work is imperative in order to provide the chemistry underpinnings to semantic web projects such as Open PHACTS (www.openphacts.org) of which Merck is sure to benefit when it is released to the community. This presentation will provide an overview of the ChemSpider platform and will also examine the challenges of dealing with heterogeneous data quality when attempting to provide a rich resource of data for the community. If you use the internet to research chemistry based data this presentation will be an essential guide to how to source high quality data.
Citation preview
Chemistry Online – The Vision and
Challenges Associated With Building
the ChemSpider Resource for Chemists
Antony WilliamsMerck, October 2012
We Have …Too Much Data!!!
It is so difficult to navigate…
What’s the
structure?
Are they in
our file?
What’s
similar?
What’s the
target?Pharmacology
data?
Known
Pathways?
Working On
Now?Connections
to disease?
Expressed in
right cell type?
Competitors?
IP?
The World of Online Chemistry
Property databases
Compound aggregators
Screening assay results
Scientific publications
Encyclopedic articles (Wikipedia)
Metabolic pathway databases
ADME/Tox data – eTOX for example
Blogs/Wikis and Open Notebook Science
Contributing Open Source code to projects
PubChem
ChEMBL
Collaborative Knowledge Management
Data on the Web
RSC’s ChemSpider
We Want to Answer Questions
Questions a chemist might ask…
What is the melting point of n-heptanol?
What is the chemical structure of Xanax?
Chemically, what is phenolphthalein?
What are the stereocenters of cholesterol?
Where can I find publications about xylene?
What are the different trade names for Ketoconazole?
What is the NMR spectrum of Aspirin?
What are the safety handling issues for Thymol Blue?
Available Information…
Linked to vendors, safety data, toxicity, metabolism
Available Information….
Crowdsourced “Annotations”
Users can add
Descriptions/Syntheses/Commentaries
Links to PubMed articles
Links to articles via DOIs
Add spectral data
Add Crystallographic Information Files
Add photos
Add MP3 files
Add Videos
ChemSpider : Spectra Linked
Spectra Linked
Spectra Linked
Chemistry Data online is messy
We have inherited errors
All public compound databases, including ours, have errors
“Incorrect” structures – assertions, timelines etc
“Incorrect” names associated with structures
Properties
Links
Publications
ENORMOUS CHALLENGE
What could create change?
Harvard Business Review (2010)
“One change would make a substantial difference [to drug R&D]: the creation of
agreed-upon standards for digitally representing drug assets.”
Consider drug structures ONLY…
The Structure of Vitamin K?
MeSH
A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K
The Structure of Vitamin K1?
What is the Structure of Vitamin K1?
CAS’s Common Chemistry
Wikipedia
ChEBI – Manual Curation
“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-enyl)naphthalene-1,4-dione”
Variants of systematic names on PubChem
2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl
2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl
2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl
2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl
2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl
2-methyl-3-[(E)-3,7,11,15-tetramethyl
2-methyl-3-(3,7,11,15-tetramethyl
2-methyl-3-[(E)-3,7,11,15-tetramethyl
Chemistry on The Internet Is Messy
It’s Methane…
What’s Methane?
What’s Methane?
What ELSE is Methane???
EPA’s DailyMed
EPA’s DailyMed
EPA’s DailyMed
With Great Fanfare…
NPC Browser http://tripod.nih.gov/npc/
NPC Browser http://tripod.nih.gov/npc/
The EXPERTS must get it right?!
Wikipedia, C&E News, PubChemC&E News (from ACS)
People Use Trusted Resources…
Earlier this month…
Stop Whining – Fix it
Crowdsourced Curation
Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate
Search “Vitamin H”
“Curate” Identifiers
“Curate” Identifiers
“Curate” Identifiers
What is the outcome of this???
IF we can get the community to help clean up the internet of chemistry then we have:
High quality online reference resources
Freely available reference data
Ongoing iterative curation – how many chemical structures are “reworked”
And what is the value of “curated chemical dictionaries???”
Successful Semantic Markup
Depends on Dictionaries
Dictionaries Enhance Publications
I want to know about “Vincristine”
Vincristine Identifiers
Vincristine: PatentsLinked by Name
Vincristine: ArticlesLinked by Name
What are the names for this
compound just in patents????
A disambiguation NIGHTMARE!
Ambiguity in Identifiers
Crowdsourcing Works
>130 people have deposited data and participated in data curation
Different level curators check each other
More curators and depositors encouraged! 28 million chemicals is a long list…
ChemSpider for Analytical Sciences
ChemSpider is being developed with the intention of
Being the world’s richest resource of freely accessible curated analytical data
As a platform for structure verification and dereplication
To provide access to supporting prediction algorithms
Spectral Uploading
Locate the structure of interest and deposit spectrum
Supported formats: JCAMP, PDF
Spectral Uploading
Various types of NMR spectra supported
Regular Updates
Multiple Spectra for One Structure
ChemSpider ID 24528095 H1 NMR
ChemSpider ID 24528095 C13 NMR
ChemSpider ID 24528095 HHCOSY
ChemSpider ID 24528095 HSQC
ChemSpider ID 24528095 HMBC
Full C13 assignment uploaded
Available Spectra http://www.chemspider.com/spectra.aspx
How do these data get curated?
Every spectrum can be commented on
Incorrect spectra have been annotated and curated by users…
But curation through gaming is also possible…
Web Services
www.SpectralGame.comhttp://www.jcheminf.com/content/1/1/9
Spectral Game
Increasing Complexity
Spectral Game
Reversed Spectrum
True Curation of Data
SpectralGame in the hand
In progress… Storage and display of ASSIGNED spectra
Mass Spec Analysis
ChemSpider Interface
Tinuvin 328
Position sorted by references
Position 1 only
Web Services
Web Services Open Up Collaboration
Agilent, Bruker, Waters and Thermo all use our web-based services for compound lookup
Many academic sites integrating directly –metabonomics, name lookup, semantic markup
Where do data come from?
ChemSpider users deposit data
Some contributions from NIST
Chemical vendors are starting to provide data. Synthonix are one of our major contributors (www.synthonix.com)
Commercial Database Access
Recently deposited to ChemSpider
EPA/NIST IR Database >5000 spectra
Presently under development
NIST MS database >200,000 MS spectra
Where next with Analytical Support?
PharmaSea project for the identification of natural products – dereplication approaches
Use mass spectrometry searches of natural product slices to identify
Pre-fragment compounds and develop searches
Dereplication using NMR data NMR features
Predicted spectra and “Verification approaches”
NMRShiftDB: http://www.ebi.ac.uk/nmrshiftdb/
NMR Prediction
NMRShiftDB Data Review
• High quality NMR shift set of ca. 100,000 shifts
• Derived prediction algorithms give very similar
performance statistics to commercial algorithms
Crowdsourcing Chemical Synthesis
How much data generated in a lab, that COULDgo public, is lost forever?
Crowdsourcing Chemical Synthesis
How much data generated in a lab, that COULDgo public, is lost forever?
Public Domain reference databases of value?
Properties
Spectra
CIFs
Images
Syntheses
An Adventure into the World of Small
but significant contribution..
ChemSpider SyntheticPages
Micropublishing with Peer Review
(a chemical synthesis blog?)
Multi-Step Synthesis
Interactive Data
MOBILE Structure Database Lookup
It is so difficult to navigate…
What’s the
structure?
Are they in
our file?
What’s
similar?
What’s the
target?Pharmacology
data?
Known
Pathways?
Working On
Now?Connections
to disease?
Expressed in
right cell type?
Competitors?
IP?
Open PHACTS Project Develop a set of robust standards…
Implement the standards in a semantic integration hub
Deliver services to support drug discovery programs in pharma and public domain
22 partners, 8 pharmaceutical companies, 3 biotechs
36 months project – goes live next month
Guiding principle is open access, open usage, open source
- Key to standards adoption -
Internet Data
The Future
Commercial Software
Pre-competitive Data
Open Science
Open Data
Publishers
Educators
Open Databases
Chemical Vendors
Small organic molecules
Undefined materials
Organometallics
Nanomaterials
Polymers
Minerals
Particle bound
Links to Biologicals
The Future of Chemistry on the Web?
Public compound databases federate & build a linked environment of validated data!
Data validation needs are not ignored
Publishers layer on information to make publications discoverable
Public-Private databases can be linked
Open Data proliferate
The “Semantic Web” in action
Can Merck Contribute to this Project?
Do you have any data that you can release into the public domain?
Measured property data
How many “common” spectra are thrown away?
How many syntheses are published and locked behind paywalls? (www.chemspider.com/reactions)
Can your scientists contribute annotations and curations if they use ChemSpider?
Is the challenge of Legal Clearance too big?
Thank you
Email: [email protected]
Twitter: ChemConnector
Blog: www.chemspider.com/blog
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams