118
Chemistry Online The Vision and Challenges Associated With Building the ChemSpider Resource for Chemists Antony Williams Merck, October 2012

Chemistry Online The Vision and Challenges Associated ... Online –The Vision and Challenges Associated With Building the ChemSpider Resource for Chemists Antony Williams Merck, October

Embed Size (px)

Citation preview

Chemistry Online – The Vision and

Challenges Associated With Building

the ChemSpider Resource for Chemists

Antony WilliamsMerck, October 2012

We Have …Too Much Data!!!

It is so difficult to navigate…

What’s the

structure?

Are they in

our file?

What’s

similar?

What’s the

target?Pharmacology

data?

Known

Pathways?

Working On

Now?Connections

to disease?

Expressed in

right cell type?

Competitors?

IP?

The World of Online Chemistry

Property databases

Compound aggregators

Screening assay results

Scientific publications

Encyclopedic articles (Wikipedia)

Metabolic pathway databases

ADME/Tox data – eTOX for example

Blogs/Wikis and Open Notebook Science

Contributing Open Source code to projects

PubChem

ChEMBL

Collaborative Knowledge Management

Data on the Web

RSC’s ChemSpider

We Want to Answer Questions

Questions a chemist might ask…

What is the melting point of n-heptanol?

What is the chemical structure of Xanax?

Chemically, what is phenolphthalein?

What are the stereocenters of cholesterol?

Where can I find publications about xylene?

What are the different trade names for Ketoconazole?

What is the NMR spectrum of Aspirin?

What are the safety handling issues for Thymol Blue?

Available Information…

Linked to vendors, safety data, toxicity, metabolism

Available Information….

Crowdsourced “Annotations”

Users can add

Descriptions/Syntheses/Commentaries

Links to PubMed articles

Links to articles via DOIs

Add spectral data

Add Crystallographic Information Files

Add photos

Add MP3 files

Add Videos

Spectra Linked

Spectra Linked

Chemistry Data online is messy

We have inherited errors

All public compound databases, including ours, have errors

“Incorrect” structures – assertions, timelines etc

“Incorrect” names associated with structures

Properties

Links

Publications

ENORMOUS CHALLENGE

What could create change?

Harvard Business Review (2010)

“One change would make a substantial difference [to drug R&D]: the creation of

agreed-upon standards for digitally representing drug assets.”

Consider drug structures ONLY…

The Structure of Vitamin K?

MeSH

A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K

The Structure of Vitamin K1?

What is the Structure of Vitamin K1?

CAS’s Common Chemistry

Wikipedia

ChEBI – Manual Curation

“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-enyl)naphthalene-1,4-dione”

Variants of systematic names on PubChem

2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl

2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl

2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl

2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl

2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl

2-methyl-3-[(E)-3,7,11,15-tetramethyl

2-methyl-3-(3,7,11,15-tetramethyl

2-methyl-3-[(E)-3,7,11,15-tetramethyl

Chemistry on The Internet Is Messy

It’s Methane…

What’s Methane?

What’s Methane?

What ELSE is Methane???

EPA’s DailyMed

EPA’s DailyMed

EPA’s DailyMed

With Great Fanfare…

NPC Browser http://tripod.nih.gov/npc/

NPC Browser http://tripod.nih.gov/npc/

The EXPERTS must get it right?!

Wikipedia, C&E News, PubChemC&E News (from ACS)

People Use Trusted Resources…

Earlier this month…

Stop Whining – Fix it

Crowdsourced Curation

Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate

Search “Vitamin H”

“Curate” Identifiers

“Curate” Identifiers

“Curate” Identifiers

What is the outcome of this???

IF we can get the community to help clean up the internet of chemistry then we have:

High quality online reference resources

Freely available reference data

Ongoing iterative curation – how many chemical structures are “reworked”

And what is the value of “curated chemical dictionaries???”

Successful Semantic Markup

Depends on Dictionaries

Dictionaries Enhance Publications

I want to know about “Vincristine”

Vincristine Identifiers

Vincristine: PatentsLinked by Name

Vincristine: ArticlesLinked by Name

What are the names for this

compound just in patents????

A disambiguation NIGHTMARE!

Ambiguity in Identifiers

Crowdsourcing Works

>130 people have deposited data and participated in data curation

Different level curators check each other

More curators and depositors encouraged! 28 million chemicals is a long list…

ChemSpider for Analytical Sciences

ChemSpider is being developed with the intention of

Being the world’s richest resource of freely accessible curated analytical data

As a platform for structure verification and dereplication

To provide access to supporting prediction algorithms

Spectral Uploading

Various types of NMR spectra supported

Multiple Spectra for One Structure

ChemSpider ID 24528095 H1 NMR

ChemSpider ID 24528095 C13 NMR

ChemSpider ID 24528095 HHCOSY

ChemSpider ID 24528095 HSQC

ChemSpider ID 24528095 HMBC

Full C13 assignment uploaded

Available Spectra http://www.chemspider.com/spectra.aspx

How do these data get curated?

Every spectrum can be commented on

Incorrect spectra have been annotated and curated by users…

But curation through gaming is also possible…

www.SpectralGame.comhttp://www.jcheminf.com/content/1/1/9

Spectral Game

Increasing Complexity

Spectral Game

Reversed Spectrum

True Curation of Data

SpectralGame in the hand

Mass Spec Analysis

ChemSpider Interface

Tinuvin 328

Position sorted by references

Position 1 only

Web Services

Web Services Open Up Collaboration

Agilent, Bruker, Waters and Thermo all use our web-based services for compound lookup

Many academic sites integrating directly –metabonomics, name lookup, semantic markup

Where do data come from?

ChemSpider users deposit data

Some contributions from NIST

Chemical vendors are starting to provide data. Synthonix are one of our major contributors (www.synthonix.com)

Commercial Database Access

Recently deposited to ChemSpider

EPA/NIST IR Database >5000 spectra

Presently under development

NIST MS database >200,000 MS spectra

Where next with Analytical Support?

PharmaSea project for the identification of natural products – dereplication approaches

Use mass spectrometry searches of natural product slices to identify

Pre-fragment compounds and develop searches

Dereplication using NMR data NMR features

Predicted spectra and “Verification approaches”

NMRShiftDB: http://www.ebi.ac.uk/nmrshiftdb/

NMR Prediction

NMRShiftDB Data Review

• High quality NMR shift set of ca. 100,000 shifts

• Derived prediction algorithms give very similar

performance statistics to commercial algorithms

Crowdsourcing Chemical Synthesis

How much data generated in a lab, that COULDgo public, is lost forever?

Crowdsourcing Chemical Synthesis

How much data generated in a lab, that COULDgo public, is lost forever?

Public Domain reference databases of value?

Properties

Spectra

CIFs

Images

Syntheses

An Adventure into the World of Small

but significant contribution..

ChemSpider SyntheticPages

Micropublishing with Peer Review

(a chemical synthesis blog?)

Multi-Step Synthesis

Interactive Data

MOBILE Structure Database Lookup

It is so difficult to navigate…

What’s the

structure?

Are they in

our file?

What’s

similar?

What’s the

target?Pharmacology

data?

Known

Pathways?

Working On

Now?Connections

to disease?

Expressed in

right cell type?

Competitors?

IP?

Open PHACTS Project Develop a set of robust standards…

Implement the standards in a semantic integration hub

Deliver services to support drug discovery programs in pharma and public domain

22 partners, 8 pharmaceutical companies, 3 biotechs

36 months project – goes live next month

Guiding principle is open access, open usage, open source

- Key to standards adoption -

Internet Data

The Future

Commercial Software

Pre-competitive Data

Open Science

Open Data

Publishers

Educators

Open Databases

Chemical Vendors

Small organic molecules

Undefined materials

Organometallics

Nanomaterials

Polymers

Minerals

Particle bound

Links to Biologicals

The Future of Chemistry on the Web?

Public compound databases federate & build a linked environment of validated data!

Data validation needs are not ignored

Publishers layer on information to make publications discoverable

Public-Private databases can be linked

Open Data proliferate

The “Semantic Web” in action

Can Merck Contribute to this Project?

Do you have any data that you can release into the public domain?

Measured property data

How many “common” spectra are thrown away?

How many syntheses are published and locked behind paywalls? (www.chemspider.com/reactions)

Can your scientists contribute annotations and curations if they use ChemSpider?

Is the challenge of Legal Clearance too big?

Thank you

Email: [email protected]

Twitter: ChemConnector

Blog: www.chemspider.com/blog

Personal Blog: www.chemconnector.com

SLIDES: www.slideshare.net/AntonyWilliams