ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web

Declaration

ChemSpider does NOT do toxicity prediction, yet We are building a content database for you to use

What ChemSpider does can be invaluable to those who do toxicity prediction Find “correct” chemical structures Find associated data (experimental/predicted) Link out to rich sources of information online Engage the community in sharing data

A Pragmatic Vision in 2006

“Build a Structure Centric Community”

December 2006 – A project initiated to connect chemistry on the web

Integrate chemical structure data on the web Create a “structure-based hub” to information and

data Provide access to structure-based “algorithms” Let chemists contribute their own data Allow the community to curate/correct data

Three Years of Experience Internet-based chemistry is a mess!

Most public compound databases on the web are contaminated. Including ours!

The annotation/curation of data online is difficult

Most database hosts are non-responsive to feedback – “We are a host/repository of data”

Who cares?

Where is chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science

What is the Structure of Vitamin K?

MeSH – Medical Subject Headings

A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K

What is the Structure of Vitamin K1?

What is the Structure of Vitamin K1?

Chemical Abstracts“Common Chemistry” Database

Wikipedia

Incorrect Structures

Wow!

Lack of Stereochemistry

Does stereochemistry matter?

Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon, Thalidomide

Comparative Toxigenomics Database

PubChem

“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-enyl)naphthalene-1,4-dione”

Variants of systematic names on PubChem

2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl 2-methyl-3-(3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl

ChEBI – Manual Curation

What’s Methane?

What’s Methane?

What ELSE is Methane???

The EXPERTS must get it right?!

Wikipedia, C&E News, PubChem C&E News (from ACS)

http://www.chemspider.com/RecordView.aspx?id=4445428

Online Datasets

Online Datasets

Online Datasets

O

O

H

H H

HH

H

H

HH

HH

H HH H

H

HH HH H

H

H

H

H

H

H

H

Online Datasets

O

O

H

H H

HH

H

H

HH

HH

H HH H

H

HH HH H

H

H

H

H

H

H

H

H

H

H

O

OH

Online Datasets

O

OH

O

O

OH

O

O

O

N

O

O

OH

OH

OH

Online Datasets

O

OH

O

O

OH

O

O

O

N

O

O

OH

OH

OH

O O

OO

O

OH OH

OH

O

O

OH

O

OH N

What Sources Do You Trust?

QSAR World

Online Datasets

The dataset for QSAR appears to have been generated with Name-to-Structure algorithms

Many systematic errors in the data – non-curated? Using such data for modeling is risky

Online Datasets

Internet-Based Chemistry is a Mess

Algorithms can get you so far in data cleaning

Human curation is necessary

Only the crowds can help with big data…

But, if we DID have a highly curated dataset… Reference database/dictionary of chemicals High quality data for modeling Centralized repository for models/data?

www.chemspider.com

We Answer Questions for Chemists Questions a chemist might ask…

What is the melting point of n-heptanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Aspirin? What is the NMR spectrum of Benzoic Acid? What are the safety handling issues for toluene?

Search for a Chemical…by name

Available Information…

Linked to vendors, safety data, toxicity, metabolism

Available Information….

Search for chemicals

ChemSpider Today

24.8 million structures 400 data sources Grows daily Community annotation and curation

We curate, edit, change, enhance data daily

Search “Vitamin H”

Search “Vitamin H”

“Curate” Identifiers




General curation activities Remove incorrect names Correct spellings Add multilingual names Add alternative names

In 3 years over 1 million structure-identifier relationships have been validated – robotically and manually

130 people have participated in validation or annotation. “Crowds” can be quite small!

Crowdsourced “Annotations”

Registered Users can add Descriptions/Syntheses/Commentaries Links to articles, blogs, wikis etc Add spectral data Add photos Add MP3 files Add Videos

Data Validation – ONE CymarinQuestion Quality in Big Databases

Data Validation – Cortisol

Data Validation in Databases

ADNPLDHMAVUMIW 509 WQZGKKKJIJFFOK 119 RUDATBOHQWOJDD 118 Ursodeoxycholic

acid GUBGYTABKSRVRQ 89 Lactose BHQCQFFYRZLCQQ 80 Cholic acid RCINICONZNJXQF 76 Taxol KXGVEGMKQFWNSR 73 Deoxycholic acid PXGPLTODNUVGFL 71 HVYWMOMLDIMFJA 69 QGXBDMJGAMFCBF 63

First request to Database Hosts!

Every public compound database host should add ONE feature – “Leave Comments”

Second request to Database Hosts! Show Comments

Linked Data on the Web

Taken from: Rafael Sidis’ Blog

What is a compound?

The InChI Identifier

Linking and Modeling Bad Data

What is the value of linking bad data?

How can we model suspect data efficiently?

Commonly data are incorrect Measured data are suspect Structures associated with data are not correct Identifiers are incorrectly associated

Properties on the Database



Linked Out to Resources

Properties Linked Off the Database

LASSO uses 23 kinds of Interactive Surface Point Descriptors and is conformation independent screens at 1 million structures/min is proven to enrich screened

databases provides scaffold hopping

Hbond Donors (5 kinds) Acceptors (5 kinds) Ambivalent H donor/acceptor Aromatic Pi-stacking (5 kinds) Hydrophobic (3 kinds) Metal ions Misc (Sulfur, Halogens)

http://dx.doi.org/10.1007/s10822-007-9164-5

SimBioSys LASSO

SimBioSys LASSO

LASSO Linked Out

Present Activities

Enhancing data model to manage more experimental properties – data available for download and modeling

Developing relationships with other software vendors and model developers for integration

Curating QSARWorld datasets for deposition

ChemSpider Tomorrow

6 months: >1.2M compounds/month 6 months: >800,000 new uniques 6 months: >60 new data sources added

Continue the curation effort and keep cleaning Finish depositions – millions left to deposit Integrate RSC content – a massive archive! Integrate RSC publishing workflows and databases Enable the semantic web for chemistry – RDF

Future Activities – Data Management

Future Activities – Data Management

Aggregating and managing data from publications

Specifically aggregating: Data from MedChemComm Reaction Data (SyntheticPages) Spectral Data

Access Data Through Web Services

Mobile Data Access

The Future of Linked Chemistry on the Internet? Public compound databases federate to build a

truly linked environment of validated data! Data validation needs are not ignored Publishers layer on information to make

publications discoverable Public-Private databases can be linked Open Data proliferate RDF is everywhere

ChemSpider & Toxicity Prediction

Continue the curation effort and keep cleaning Web services allow integration and data download Presently collaborating with groups to provide

access to data for modeling Intention is to provide the highest quality online

database with associated data

Community Contribution and Innovation “Community contribution”

best practice award”

i-Expo Innovation Award:June 2010 ALPSP Innovation Award: September 2010

Thank you

Email: [email protected] Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams

Documents

ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Across the Web